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Abstract 



The purpose of this dissertation is to introduce a version of Stein's method of ex- 
changeable pairs to solve problems in measure concentration. We specifically target 
systems of dependent random variables, since that is where the power of Stein's 
method is fully realized. Because the theory is quite abstract, we have tried to put 
in as many examples as possible. Some of the highlighted applications are as fol- 
lows: (a) We shall find an easily verifiable condition under which a popular heuristic 
technique originating from physics, known as the "mean field equations" method, is 
valid. No such condition is currently known, (b) We shall present a way of using 
couplings to derive concentration inequalities. Although couplings are routinely used 
for proving decay of correlations, no method for using couplings to derive concen- 
tration bounds is available in the hterature. This will be used to obtain (c) con- 
centration inequalities with explicit constants under Dobrushin's condition of weak 
dependence, (d) We shall give a method for obtaining concentration of Haar measures 
using convergence rates of related random walks on groups. Using this technique and 
one of the numerous available results about rates of convergence of random walks, 
we will then prove (e) a quantitative version of Voiculescu's celebrated connection 
between random matrix theory and free probability. 
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Chapter 1 
Introduction 



The theory of concentration inequahties tries to answer the following question: Given 
a random variable X taking value in some measure space X (which is usually some 
high dimensional Euclidean space), and a measurable map / : X ^ M, what is a good 
explicit bound on P{|/(X) — E/(X)| > x}7 Exact evaluation or accurate approxima- 
tion is, of course, the central purpose of probability theory itself. In situations where 
this is not possible, concentration inequalities aim to do the next best job. 

A bound is good if it is sufficiently rapidly decreasing; gaussian bounds are usu- 
ally considered satisfactory. The reasons for insisting on good bounds (as opposed to 
Chebychev type bounds) are theoretical as much as practical. In fact, the theoreti- 
cal interest often supercedes the practical aspect, because in spite of all the activity, 
concentration bounds often give bad numbers when calculated numerically. Theoreti- 
cally their importance stems, in large part, from the Bonferroni inequality: If we know 
that if a collection of events {Ak}i<k<n are so rare that P{Afc} < e""^" for each k for 
some fixed constant c, then PjUfcAfc} < ne~^", which is again "small", since the e"'^"' 
term "kills" the n. Contrary to what someone unfamiliar with the literature might 
feel, this seemingly crude technique has been successful in establishing surprisingly 
efficient results, mainly because "rare events are often approximately disjoint" . 

This was at the center of the earliest line of thought about controlling the suprema 
of empirical processes, developed mainly by David Pollard and others in the eight- 
ies (see Pollard [87j for an exposition). Subsequent developments in the nineties. 
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based on Talagrand's concentration inequalities |1U7| llU9t IllUj and the more recent 
"entropy method" of Ledoux jHI] and Massart [02], are more subtle. They will be 
discussed later in detail. All in all, concentration inequalities form the backbone of 
this very important branch of modern theoretical statistics and machine learning. In 
return, empirical process theory forms the most prominent area of application for 
concentration bounds. 

Concentration inequalities are also used in theoretical computer science, random 
matrix theory and a variety of other fields, for reasons more or less the same as 
mentioned before. Often, applications get obscured because they were not mentioned 
in the abstract or the keywords, and that is a sign that concentration inequalities are 
rapidly attaining the status of standard tools like the Borel-Cantelli lemmas in the 
classical probability literature. There may soon be a time when researchers will start 
using Talagrand's inequalities without explicit reference in the abstract. 

The theory of concentration inequalities for functions of independent random vari- 
ables, some of which was described above, has reached a high level of sophistication 
by now. However, concentration inequalities for functions of dependent random vari- 
ables are still hard to get, the main tools being logarithmic Sobolev and transportation 
cost inequalities. One shortcoming of these methods is that explicit constants are very 
hard or almost impossible to get. We shall discuss these methods in detail in the next 
chapter. 

The main purpose of this thesis is to construct a modification of Stein's method 
of exchangeable pairs, which is a well-known tool from probability theory, to derive 
concentration inequalities with explicit constants for functions of dependent random 
variables. We postpone a discussion of Stein's method until Section ITTI 

1.1 Summary of thesis 

We now give a brief chapter by chapter description of this thesis. The theory is too 
abstract and requires too much notation to describe in this brief introductory discus- 
sion. Instead, we shall present some theorems which were obtained as applications of 
the theory, for the purpose of enticing the prospective reader to delve deeper. 
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In Chapter we shall review the major existing results from the concentration 
inequality literature, including the early martingale techniques, Talagrand's inequali- 
ties, and modern entropy based methods. We shall also describe the basic philosophy 
of Stein's method in the last section of this chapter. 

Chapter El is the first part of our theory, which is the more "abstract" part. We 
shall develop our basic results in this chapter, and apply them to work out some 
simple examples involving dependent variables: In section|2I31 we shall obtain a simple 
and explicit concentration bound for m — tanh(/?m), where m is the magnetization 
in the Curie- Weiss model of ferromagnetic interaction. By a generalization of this 
example, we shall derive in section 13.41 a broad condition under which the naive 
mean field equations — one of the standard tools from physics that is frowned upon 
by mathematicians (for good reasons) — is valid. In section 13.51 we shall work 
out an example about estimation of parameters in statistical models of dependent 
data. In section 13.71 we shall apply our techniques to obtain tail bounds of the 
correct order for (a) the number of fixed points in a random permutation, and (b) 
the Spearman's footrule distance between a random permutation and the identity. 
Finally, in section 13. 101 we shall obtain a concentration result about the Sherrington- 
Kirkpatrick model of spin glasses which holds at all temperatures. 

In Chapter m, we shall develop some advanced tools f Lemma 14.11 Lemma f4.2p so 
that the theorems from Chapter El can be easily applied to more complex problems 
than the ones worked out in Chapter El Indeed, Lemma 14.21 in conjunction with 
Theorem 13.31 is probably the first general tool which allows one to use couplings 
to prove concentration inequalities. Couplings are widely used to establish rates of 
temporal and spatial decay of correlations; so it seems natural that there should be 
some result which makes them useful for concentration inequalities, too. 

An application of these tools gives Theorem 14.31 which allows us to derive con- 
centration inequalities for arbitrary functions in complicated models of dependent 
random variables, such as Ising type spin models and random proper colorings of 
finite graphs, under a Dobrushin type condition of weak dependence. We shall now 
state this theorem, after introducing some required notation. 

Let Q he a Polish space and let / : ^2" ^ M be a function satisfying a Lipschitz 
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condition with respect to a generalized Hamming distance on fi": For all x,y G fi", 

n 

\f{x)-fiy)\<Y,c^^x.^y.} (1.1) 

i=l 

where Ci, . . . , c„ are some fixed nonnegative constants. This just means that the value 
of the function does not change by more than q if the i^^ coordinate is altered. 

Let X = {Xi, . . . , Xn) be an ^"-valued random variable with law fi. For any 
X G fi", let X* denote the element of fi"^^ obtained by omitting the i^^ coordinate 
of X. For each i < n and x G Vt^, let /ij(-|x*) denote the law of Xi given X* = x*. 
Finally, let us recall that for a square matrix A, the operator norm of A is defined 
as: 

llylllo := max IIA-ull. 

llyll=i 

Then we have the following result from section 14.21 

Theorem 14.31 Suppose A = {aij) is an n ^ n matrix with nonnegative entries and 
zeros on the diagonal such that for any i, and any x,y & Q^, 

n 

dTvifJ'ii-\x'),iJ,i{-\f)) < ^aijl{xj ^ yj}, 

i=i 

where d^v is the total variation distance on the space of probability measures on Q. 
Suppose f satisfies the generalized Lipschitz condition il.l]} . If \\A\\2 < I, we have 

P{|/(X) - E/(X)| >t}< 2e-(^-ll^ll2)*'/^''=? 

for each t > 0. 

This is possibly the first result which gives a direct connection between the Dobrushin 
condition [HH] from statistical physics and concentration inequalities. The log-Sobolev 
inequalities of Stroock & Zegarlinski jl(J3l llU4t ll(J5j do not give explicit constants, 
and it is also not clear whether they extend to systems beyond the lattice. 

Some applications of the above theorem, to spin systems and graph colorings, 
will be worked out in sections 14.31 and 14.41 In particular, we shall work out explicit 
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concentration bounds for the magnetization in the Ising model on an arbitrary graph 
at high temperature. 

In section 14.51 of the same chapter, we shall derive a tool (Theorem 14. 6|) for ob- 
taining the concentration of measures which are invariant under group actions. This 
is possibly the first result which gives an explicit connection between rates of con- 
vergence to stationarity for random walks on groups and the concentration of Haar 
measures. Using this result, we shall obtain the following quantitative bound related 
to random matrices and free probability theory, which again, is probably the first re- 
sult of its kind. The limiting version of this result is quite well-known; following from 
a celebrated result of Voiculescu |114j . it says, roughly, that the distribution of the 
eigenvalues of M + A^, where M and are two high dimensional hermitian matrices 
of the same order, is approximately determined by the eigenvalue distributions of M 
and A^. 

In the following theorem, the term "empirical distribution function of H" is a 
commonly used random matrix jargon, which just means the probability distribution 
function on M which puts mass 1/n on each eigenvalue of the matrix H. 

Theorem 14.101 Let Ai and A2 be two n x n real diagonal matrices. Let U and 
V he independent Haar distributed random elements ofUn, the group of all unitary 
matrices of order n. Let 



and let Fh he the empirical distribution function of H . Then, for every x G M, 
Var(FH^(x)) < nn~^\ogn. where n is a universal constant not depending on n, Ai, 
A2 or X. Moreover we also have the concentration inequality 



for every t > 0, where k is the same as in the variance bound. 

We think that it will be very hard to derive such a result using available techniques, 
because there is no independence, and gaussianity is involved in a comphcated way, 
and the standard concentration results for gaussian measures are all with respect to 



H = UAiU* + VA2V*, 
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the Euclidean metric, which cannot give a universal bound like the above (bounds, if 
any, will involve Ai and A2, and will be inefficient, because the function is badly 
nonsmooth). On the other hand, our bound follows quite easily from the theory 
developed in this dissertation. We do not know, however, if it is of the correct order 
(even after discounting the logn factor). 

Finally, let us mention some of the deficiencies of this dissertation. One shortcom- 
ing is in the range of examples. Although we have tried our best to provide as many 
as we can, it is probably not enough to cover all fields of interest; in particular, we 
have no examples from empirical process theory. Moreover, the author is not com- 
pletely happy with the quality of some of the applications. For instance, the result 
about spin glasses seems to have a wide scope, but the author has been frustrated by 
attempts to exploit it further. Another unfinished aspect is that we could not find 
good examples for the "unbounded differences" theorems of section 13.81 

It is also unfortunate that the technique in Chapter 14.21 has only been applied to 
get Theorems 14.31 and 14.61 We feel that the scope of Lemma 14.21 which allows us to 
get concentration bounds using couplings, extends much beyond that. 

Finally, a major incompleteness, which is a weakness of the idea of concentration 
itself, is the lack of lower bounds. 

We shall attempt to overcome these and other deficiencies in future work. 



Chapter 2 



Review of existing literature 

This chapter will be devoted to the discussion of the main existing tools for proving 
concentration bounds. We apologize in advance for unjust omissions, if any. Most of 
the material in this chapter, except for the very recent developments, is taken from 
the wonderful monograph by Michel Ledoux iBl]. In fact, in many places we have 
kept the notation and even the language, intact. 

While we shall attempt to give a comprehensive summary of the main theoretical 
results, we shall, in general, refrain from discussing applications in this chapter. The 
main reason being, applications invariably entail some lengthy background, the dis- 
cussion of which would be a digression from the central theme of this chapter. This 
is the same reason why we shall not discuss concentration inequalities in empirical 
process theory. Of course, generous references will be provided. 

In the last section of this chapter, we shall discuss the basics of Stein's method. 

2.1 HoefFding type inequalities 

The Azuma-Hoeffding martingale inequality |52[ [7j remained the last word in concen- 
tration for a very long time, culminating in the bounded difference inequality, observed 
by Schechtman [HZ] and also by Shamir & Spencer and its extensive popularity 
among combinatorialists and discrete mathematicians following the expository work 
of McDiarmid [HI- 
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In its most widely used form, the Azuma-Hoeffding inequality goes as follows: 

Theorem 2.1 [Azuma-Hoeffding inequality E]] Let {Xi}i<i<n be a martingale 
difference sequence adapted to some filtration. Suppose Ci, . . . , c„ are constants such 
that \Xi\ < Ci almost surely for each i. Then for all t > we have 



Before stating the bounded difference inequality, we must state the most "accurate" 
version of the Hoeffding inequality for independent summands, which was established 
by Bennett [HI- 

Theorem 2.2 [Bennett's inequality [11] Let Yi, . . . ,Yn be independent real-valued 
random variables bounded in magnitude by some constant C . Let S = Yl^=i ^i- Then, 
for every t > 0, 



where h{u) = [l + u) log(l + u)-u, and = XlILi ^i^i)- 

Originally proved by Bernstein for Bernoulli random variables, this is also sometimes 
called Bernstein's inequality. 

The bounded difference inequality, stated in its simplest form, is the following 
powerful corollary of the Azuma-Hoeffding inequality: 

Theorem 2.3 [Bounded difference inequality 1^111^111111] i^et Z = /(Xi, . . . ,X„) be 
a function of independent random variables Xi, . . . ,X„. Let X[ be an independent 
copy of Xi, i = 1, . . . ,n. Suppose Ci, . . . , c„ are constants such that for each i. 





|/(Xi, . . . , Xi_i, Xj', Xj+1, . . . , Xn) — /(Xi, . . . , Xn)\ < Ci a.s. 



Then, for any t > we have 



F{Z - E{Z) >t}< exp 
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The above inequality, though widely useful, does not in general convey the theoretical 
essence of the situation, because the bounds on the differences do not reflect the 
typical size of the differences in many interesting problems. A better result, from 
the conceptual point of view, is the following inequality, first discovered by Efron 
and Stein jlU], which has since come to be known as the Efron-Stein inequality. The 
present version is due to Steele |lUUj : 

Theorem 2.4 [Efron-Stein inequality [101 llUOj ] Keeping the notation exactly as in 
the previous theorem, we have 

1 

Var(Z) < 2 E ^ [(/(^i' • • • ' ^-1' ^^+1' • • • ' - /(^i' • • - ^n)?] . 

i=l 

The Efron-Stein inequality has been used to bound variances in many complicated 
problems. For a very recent application, one can see the work of Reitzner on 
random polytopes. However, the inequality is not guaranteed to give a bound of the 
correct order. For instance, the actual variance of the number of fc-gons in an Erdos- 
Renyi random graph is less than the Efron-Stein bound by a factor of k (Cf. 
section 6). Also, it is only a variance bound and gives no useful information about 
the tail behavior. 

All of the above results are based on martingale methods. The limitations of mar- 
tingale techniques were gradually recognized in the late eighties, and people started 
looking for alternatives. In recent years, however, there has been some kind of a minor 
resurgence of interest in extending the old martingale arguments. One of the more 
successful efforts has been the so-called "divide and conquer martingale" method of 
Kim & Vu [HE] • We refer to this paper for further references to the current literature 
around the martingale method. 

A significant discovery was recently made by Boucheron, Lugosi & Massart [19], 
who established an exponential version of the Efron-Stein inequality. In our opinion, 
this work has resulted in the completion of the quest for maximum efficiency in 
concentration via Hoeffding type inequalities. 

The methodology used to prove the exponential Efron-Stein is based on the "en- 
tropy method" introduced by Ledoux [HI] and Massart and generahzed by 
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Boucheron, Lugosi & Massart in ^8]. This powerful new method is based on the 
modified log Sobolev approach, to be discussed in section 1231 

Before stating the exponential Efron-Stein inequality, we have to introduce some 
notation. 

Let Xi, . . . ,Xn be independent random variables taking values in a measurable 
space X. Denote by X" the vector of these n random variables. Let / : X" M be a 
measurable map. Let Z = f{Xi, . . . , Let X[, . . . , X'^ denote independent copies 
of Xi,...,X„, and let Z» = /(Xi, . . . , X,_i, X;, X,+i, . . . , X„). 

Define the random variables V+ and V- by 



\A = E 



i=l 



Xf 



and 



y_ = E 



Xf 



-i=l 

The following result appears as Theorem 2 in jl9j : 

Theorem 2.5 [Exponential Efron-Stein inequality] For all 6 > and A G (0, 1/0), 



logE[exp(A(Z-E(Z))] < log 



E 



exp 



AV+ 



On the other hand, we have for all 6 > and A G (0, 1/^), 



A^ 



logE[exp(-A(Z-E(Z))] < ^-^log 



E 



rxv. 

exp I 
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The paper JH] has several nice applications of this result, including applications 
to empirical processes and subgraph counts. In particular, the authors prove that 
Talagrand's famous convex distance inequahty (to be discussed in the next section) 
is a corollary of the exponential Efron-Stein inequality. 

The more recent paper [20^ on generalized moment inequalities in the same spirit 
as the exponential Efron-Stein inequality, is also worthy of note. 
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2.2 Talagrand's concentration inequalities 

This section is devoted to the discussion of the deep investigation of Michel Talagrand, 
which chmaxed in a series of intricate and powerful results about concentration of 
measure in product spaces |lU7t llU9t IllUj . These have since come to be known as 
Talagrand's concentration inequalities. Rooted in abstract geometric formulation, 
Talagrand's techniques have found wide applications in fields ranging from combi- 
natorial optimization (e.g. traveling salesman problem in |lU7j ) to random matrix 
theory [SUIIS- 

Before we begin our discussion, let us introduce some basic notation: Throughout, 
we consider a product probability measure /x" on a product space X". For any vector 
c = (ci, . . . , c„) G M" , the generalized Hamming metric dc is defined on X" as 

n 

y) := ^ Cil{xi 7^ yi}. (2.1) 

i=l 

Thus, the bounded difference inequality says that if / : X" — > M is a function such that 
\f{x) - f{y)\ < dcix,y) for all x and y, then - / /ci/x" > t} < exp(-tV2||cf ), 
where ||cp := ^27=1^1- 

A fundamental weakness of this inequality and its early variants is the following: 
They only allow us to consider functions / which satisfy the Lipschitz condition for 
some fixed c. It was soon realized that a lot of open questions in measure concentration 
were about functions which do not satisfy ()2.1|) . but rather, obey 

n 

f{x)-f{y)<J2(^^i^n^i7^yi} (2-2) 
1=1 

or some variant of this, for some vector field c : X" ^ with uniformly bounded 
norm. (The asymmetry in the above expression is often a help rather than hindrance 
in applications.) It is thus desirable to have a Hoeffding type bound for such functions 
based on sup{||c(x)p : x G X"}. Talagrand's famed convex distance inequality is, in 
essence, a generalization of this idea in the abstract geometric setting. 
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Given a set A C X", let us define the "convex distance" of a point x to A as 

Da{x) := sup dc{x,A), 

where dc{x,A) = inf y(z a dc{x,y). Note that Da_{x) = dc(x){x,A) for some c(x) de- 
pending on X if the supremum is attained (which is usually the case). 

The following version of Talagrand's main result is taken from Ledoux [HI], The- 
orem 4.6: 

Theorem 2.6 [Talagrand's convex distance inequality |jl07|] For every measurable 
non-empty subset A ofX"", and for every product probability measure /i" on X", 



In particular, for every t > 0, 



To connect this with concentration problems for specific functions, the usual route is 
the following: Given a function / with median mf (that is, > m^} > 1/2 and 

< nif] > 1/2), let A = {x : f{x) < m/}. Then /i"(A) > 1/2. Next, find a 
function r{t) such that f{x) — rrif > t implies Da{x) > r{t). It is then easy to see 
that 

-mf>t}< fx''{DA > r{t)} < 2e-^'W'/^ 

The abstract formulation allows us to go beyond Lipschitz functions, in the following 
way (which is also the usual method of applying the convex distance inequality): 
Given a function / with median mj, we have to find a function r{t) having the 
following property: Whenever f{x) > nij + t and f{y) < nif, there exists some 
c = c(x) G M" such that ||c|| = 1 and ^"^^ Ci{x)I{xi ^ yi} > r{t). In the particular 
case when r{t) = t, we get the condition /(x) — f{y) < Yl^=i Ci{x)I{xi ^ yi}, which 
was stated as the initial motivation for this discussion. However, we may have r{t) ^ 
t also, as demonstrated in Talagrand's bound for the concentration of the longest 



CHAPTER 2. REVIEW OF EXISTING LITERATURE 



13 



increasing subsequence in a random permutation f |lU7j . section 7.1). 

As is usual in geometric measure concentration (to be discussed in section 12. 5p , 
Talagrand's inequality gives concentration around the median rather than the mean, 
but that is a minor inconvenience. 

A very important and striking consequence of Theorem 12.61 is the following: 

Corollary 2.7 [Talagrand |lU7j ] For every product probability /i" on [0, 1]", every 
convex 1 - Lip schitz function f on M", and every t >0, 

fi''{\f-mf\>t}<4e-''/\ 

where nif is the median of f for 

The applications of Talagrand's convex distance inequality are too diverse to sum- 
marize in a few paragraphs. Many of the most striking examples were worked out by 
Talagrand himself in his landmark paper |lU7j , including applications to the traveling 
salesman problem and first passage percolation. Applications to the concentration 
of spectral measures of random matrices were made by Guionnet & Zeitouni jSD]. 
Concentration of individual eigenvalues of gaussian random matrices was established 
using Theorem 12.61 in a short but remarkable paper by Alon, Krivelevich & Vu 

Another important result of Talagrand is the "control by several points" method, 
which we shall not discuss here. For a description of this technique, one can look at 
Talagrand's original paper jlU7j or section 4.3 in Ledoux' book [HI]. Talagrand used 
this method to prove his celebrated Bernstein type bound for empirical processes. 
This result is presented in a nice way as Theorem 7.4 in j61j . 

2.3 Logarithmic Sobolev inequalities 

Logarithmic Sobolev inequalities were introduced by Gross |1H] as the infinitesimal 
version of hypercontractivity in quantum field theory. They soon became a tool of 
fundamental importance, and the last three decades have seen vigorous activity sur- 
rounding the theory and applications of log-Sobolev inequalities. It is not our purpose 
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to go deeply into all that; we shall be only concerned with their relevance in mea- 
sure concentration. For other applications in probability and statistical mechanics, 
one can look at the lecture notes by Guionnet and Zegarlinski ^\ which concentrate 
on physical applications, and the paper by Diaconis and Saloff-Coste on appli- 
cations of log-Sobolev inequalities to finite Markov chains. An excellent survey of 
known mathematical results can be found in 

Suppose Xq, Xi, . . . , is a stationary reversible Markov chain on some space X. The 
Dirichlet form corresponding to this chain (or more appropriately, corresponding to 
the associated kernel) is a functional £ defined on the space of all pairs /, g of maps 
from X into M which satisfy E(/(Xo)^) < oo and E(5f(Xo)^) < oo. It is defined as 

E{f,g) := ^E[(/(Xi) - f{X,){g{X,)-g{Xo))]. 

The kernel is said to satisfy a logarithmic Sobolev inequality with constant c if for all 
/ such that E(/(Xo)^) < oo, we have 

Often, probability measures have natural reversible kernels associated with them. 
In such cases, instead of saying that the kernel satisfies a log-Sobolev inequality, 
we say that the probability distribution satisfies such an inequality. For instance, a 
distribution on M" which has density p{x) with respect to Lebesgue measure, is, under 
appropriate conditions, the stationary distribution of the Langevin diffusion process 
{Xt)t>o with constant volatility matrix ^/2I and drift Vlogp(x). Instead of Xq and 
Xi we now take Xq and X^, where h > 0, but divide the right hand side by h in 
the definition of the log-Sobolev inequality. For the Langevin diffusion under suitable 
conditions, if we take take h I 0, then h~^8.{f, f) E(|| V/(Xo)p). This observation 
motivates the following definition: A probability measure /i on R" is said to satisfy a 
logarithmic Sobolev inequality with constant c if for all locally Lipschitz /, 
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To most people, this is the usual definition of a log-Sobolev inequahty, but the earlier 
form is useful for discrete problems. The left hand side is usually called the entropy 
of p with respect to /x, and is denoted by Ent^(/^). 

The single most important property of log-Sobolev inequalities is perhaps the ten- 
sorizing property: If /ii, . . . are probability measures on M satisfying log-Sobolev 
inequalities with constants Ci, . . . , c„, then the product measure /ii ■ ■ ■ (S> yU„ on R" 
satisfies log-Sobolev inequality with constant maXjCj. 

The connection between logarithmic Sobolev inequalities and concentration was 
made in an unpublished but now famous argument of I. Herbst. The following the- 
orem, which summarizes the end result, is taken from Ledoux |61j. Theorem 5.3: 

Theorem 2.8 [Herbst 's lemma] Let fi be a probability measure on M" satisfying a 
log-Sobolev inequality with constant c. Then, every 1-Lipschitz function f : X ^ M. is 
integrable and for every t > 0, fi{f > J fdfi + t}< e~*^/^^. 

Ledoux also presents a discrete version of the above result; the following occurs as 
Theorem 5.17 in [61j: 

Theorem 2.9 Let fi be the stationary distribution of a reversible Markov chain {Xk}k>o 
on a countable set X, which satisfies a logarithmic Sobolev inequality with constant c. 
Then, any / : X M which satisfies 

\ supE((/(Xi) - /(Xo))'|Xo = a;) < 1, 
also satisfies > / /rf/i + t} < e^*^/'^'^ for any t > 0. 

One class of measures for which explicit log-Sobolev constants (and hence, concentra- 
tion inequalities) are easily available are high dimensional probability measures with 
strictly log-concave density with respect to Lebesgue measure; this is a consequence 
of an important work of Bakry and Emery 

Theorem 2.10 [Bakry-Emery criterion] Suppose ^ is a measure on R" with density 
Q-uix) yjj^iii respect to Lebesgue measure, where Hess U (x) > al for some fixed constant 
a > 0, for all x G R". Then /i satisfies a log-Sobolev inequality with constant 1/a. 
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The result can be supplemented by the observation that if /i satisfies a logarithmic 
Sobolev inequality with constant c, then the measure defined by du = Z~^e^d^ (where 
Z is the normalizing constant) satisfies a log-Sobolev inequality with constant ce'*"^"°° . 
This is the famous perturbation argument of HoUey and Stroock jSH]. However, this is 
not a very useful result in high dimensions, because ||l^||oo blows up as the dimension 
increases. 

It is to be noted that we do not necessarily have to go through the Herbst argument 
and Bakry-Emery's result to get the concentration inequality for strictly log-concave 
measures. Ledoux jHI], pp. 39-41, has a direct proof. 

However, in other cases, like spin systems on a lattice, there is no direct argument 
for concentration inequalities, and the route through the Herbst lemma must be used. 
In their important work |103l 11041 1105j , Stroock and Zegarlinski established that the 
Glauber dynamics associated with Ising and other spin models satisfy log-Sobolev 
inequalities at high temperature, wherever Dobrushin's condition of weak dependence 
holds. (See Georgii Chapter 8, for the definition and examples for Dobrushin's 
condition. For a readable account of the work of Stroock and Zegarlinski, see the 
lecture notes jl^.) 

This implies, by Herbst 's argument, that those spin systems satisfy concentration 
properties at high temperature. However, since explicit log-Sobolev constants are not 
known, it is not possible to get explicit concentration bounds along those lines. 

This brings us to the point where we can state one of our achievements in this 
dissertation, which we already stated once in the Introduction: In section 14.21 we 
shall prove a concentration inequality with explicit constants for systems which satisfy 
Dobrushin's condition. We shall apply our result to get explicit concentration bounds 
for generalized Lipschitz functions of spins in Ising type models and uniformly chosen 
proper fc-colorings of graphs with maximum degree < k/2. 

For further information and references about the use of logarithmic Sobolev in- 
equalities in the field of measure concentration, the reader is encouraged to look at 
the comprehensive survey jHOj. For further details about their applications to spin 
systems, as well as some easy proofs of known results, one can look at [U^ . 
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2.4 Other entropy based methods 

Ever since the advent of Herbst's argument, entropy-based methods have played an 
increasingly important role in the concentration literature. In fact, at the time of writ- 
ing this thesis, they are probably the most active tool of research in concentration 
inequalities. We have already discussed the logarithmic Sobolev approach to con- 
centration as developed by Ledoux jHI] and Massart [69^. In this section, we briefly 
describe two other important information theoretic methods, namely, the modified 
log-Sobolev inequalities of Ledoux and the transportation cost inequalites of Marton. 

Modified log-Sobolev inequalities. A probability measure fi on M" is said to 
satisfy a modified logarithmic Sobolev inequality if there is a function /3(p) > on M, 
such that, whenever ||V/||oo < P, 



Modified log-Sobolev inequalities were introduced by Bobkov and Ledoux [12] to 
provide an easier alternative to Talagrand's method for proving exponential concen- 
tration. They showed that like log-Sobolev inequalities, modified log-Sobolev inequal- 
ities also tensorize in a certain sense. An important consequence of this is the following 
concentration result for product of measures which satisfy Poincare inequalities (the 
exponential distribution being the prototype for such measures): 

Theorem 2.11 Suppose fi is a probability measure on M satisfying a Poincare in- 
equality with constant c; that is, for any locally absolutely continuous f , Var^(/) < 
c / If'l'^dfj,. Then, any function F : M" — > M satisfying 




for all locally absolutely continuous / such that / e^d^ < oo. 
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jjP'-almost everywhere is integrable with respect to fjP' , and for every t > 0, 

^Jr{F> I Fdf,- + t} <exp(^-^mm(^^,^^^y 

This occurs as Corollary 5.15 in [HI]. For more on modified log-Sobolev inequalities, 
one can look at the survey [HU]. Some recent developments using the modified log- 
Sobolev approach (alternatively, the "entropy method" ) have already been discussed 
in section Em 

Transportation cost inequalities. Transportation cost inequalities were intro- 
duced by Marton jHS] as a version of measure concentration which works by investi- 
gating distances between measures. 

Let's begin with some familiar definitions. The informational divergence (or 
Kullback-Leibler divergence, or relative entropy) of a measure u with respect to an- 
other measure n on the same space is defined as: 



If these measures are probability measures defined on a Polish space (X, d), then the 
and Wasserstein distances between u and fi are defined as 

Wi{iy,fi) ■=ME^d{Y,X) and W2{u, fi) := mf\E^d{Y, X^V^^ 

TT TT 

where Y and X are random variables distributed according to the laws u and /i, and 
the infimum is taken over all distributions vr on that have u and fi as marginals. 

A probability measure on X is said to satisfy a transportation cost inequality 
with constant c, if 



jj,) < A/2cD(z/||/i) for all probability measures u on X. 



Similarly, fi satisfies a quadratic transportation cost inequality (or, as Marton prefers 
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to call it, a distance-divergence inequality) if 



W2{i^,fi) < ^/2cD{h'\\fi) for all probability measures z/ on X. 



It is not difficult to deduce from either of the above conditions that for any two 
measurable sets A, i? C X, 



where d{A,B) := m{{d{x,y) : x E A, y E B}. Putting B = {x : d{x,A) > t} in the 
above, we obtain an abstract measure concentration inequality. 



where At := {x : d{x,A) < t}. This interpretation of measure concentration was put 
forward by Milman in the late seventies. In particular, it easily implies concentration 
inequalities for Lipschitz functions. This will be discussed in some detail in section ITHl 

The above was the original line of argument by Marton connecting transportation 
cost inequalities with concentration of measure. 

Quadratic transportation cost inequalities have been more successful for Euclidean 
spaces, mainly because they have a tensorizing property similar to log-Sobolev in- 
equalities, which is not true for transportation cost inequalities based on the Wi 
metric. 

There seems to be a close connection between log-Sobolev and quadratic trans- 
portation cost inequalities. Indeed, Otto & Villani |80j proved that any measure which 
satisfies a logarithmic Sobolev inequality also satisfies a quadratic transportation cost 
inequality. The converse, however, is still an open question. For more information on 
mass transportation, we refer to the recent treatise |113j . 

One significant success of the transportation cost method has been its application 
to concentration for Markov chains. Originally proved by Marton 66^ for contracting 




1 - KA) < e 
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Markov chains, the results were later extended to Doeblin recurrent chains and $- 
mixing processes by Samson jUHl and simultaneously by Marton in a manuscript which 
was not sent for publication. We shall now briefly describe the result for contracting 
Markov chains. 

Let /i be a probability measure on X ■ ■ ■ X Xn induced by a Markov chain 

with successive transition kernels Hi, i = 1, . . . ,n; that is, 

dlj{Xi, . . .,Xn) = Ili{dXi)Il2{Xi, dX2) ■ ■ ■ n„(x„_i, dXn). 

Assume that the Markov chain is contracting, that is, there exists p < 1 such that for 
each 2 < i < n and x,y E Xi, 

drvO^iix, ■),Iliiy, ■)) < P- 
Then, the following analog of Theorem 12.61 holds: 

Theorem 2.12 [Marton jHS]] Por any measurable nonempty subset A ofX, 

J K^) 
As in Theorem Y2.b\ this implies 

p{Da >t}< -^e-(i--)^*V4. 
P{A) 

Consequently, if / : X — M is a map with median nif and r{t) is a function such 
that whenever f{x) > nif + t and f{y) < nif, there exists a vector c(x) G M" (not 
depending on y) with norm 1 such that f{x) — f{y) < J2i=i Ci{x)I{xi ^ yi}, then 



At this point, we should also mention that there is some recent work of Houdre and 
Tetali [HU ESI concentration for Markov chains using a different approach, which 
we shall not discuss here. 
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Since the pioneering work of Marton, a significant number of important contri- 
butions were made by various authors. Talagrand |l(J8j proved, among other things, 
that the gaussian measure on M" satisfies a transportation cost inequahty. Dembo 
|26j used the transportation cost and other information theoretic methods to rederive 
most of Talagrand's abstract inequahties. The papers by Bobkov & Gotze J7], and 
Bobkov, Gentil & Ledoux [1^] are also important. We refer to Ledoux JUT, Chapter 6, 
for details. 

More recently, Marton |H71 IHHj has worked on developing transportation cost in- 
equalities for highly dependent systems of random variables that usually occur in 
statistical physics. In such models, the only tractable objects are the conditional 
distributions of small subcoUections given the rest. The results in jHTl IHH] hold un- 
der Dobrushin-Shlosman type contractivity conditions. Dobrushin type conditions 
of weak dependence, which originated in statistical physics, will be discussed in sec- 
tion l4.2l where we shall also present a method based on exchangeable pairs, for directly 
obtaining concentration inequalities with explicit constants under similar situations. 

2.5 Geometric measure concentration 

This section contains a brief discussion of a closely related topic, called "concentration 
of measure" in geometric analytic circles. It is, in fact, a generalization of the idea of 
concentration inequalities, which has been a topic of interest in geometric functional 
analysis and convex geometry in the last three decades. The idea rests on the fol- 
lowing basic observation: In high dimensional spaces, certain probability measures, 
including products of well behaved one- dimensional measures, exhibit the "concentra- 
tion property", which means that any set which has measure > 1/2, "engulfs" most 
of the space when slightly expanded. The idea was pioneered by Paul Levy who 
observed this phenomenon for the uniform measure on high dimensional spheres; but 
the connection of his work with concentration inequalities remained obscure for many 
years until the papers by Amir & Milman j3] and Gromov & Milman [47, revived 
mathematical interest in this very fundamental feature of high dimensional measure 
spaces. 
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Let us now formalize the notion expressed vaguely in the last paragraph. Given a 
Polish space (X, d) and a probability measure ^ on X, the "concentration function" 
«(x,d,/.) : [0, oo) [0, 1] is defined as: 

«(x,d,M) W := sup{l - fi{At) -.ACT, fiiA) > 1/2}, 

where At := {x & X : d{x,A) < t}, with d{x,A) := mi{d{x,y) : y G A}, as usual. 
This definition is due to Amir and Milman [3]. A measure is "concentrated" when 
a decreases rapidly as t grows. The connection with concentration inequalities for 
Lipschitz maps comes through the following easy observation: If / : X — > M is a 
Lipschitz function with Lipschitz constant ||/||Lip, and rrif is a median of / (that is, 
> ruf} > 1/2 and j^if < mj} > 1/2), then 

MIZ-'^/l >0<2«(X,d,M)(t/ll/l|Lip). 

The advantage of using concentration functions is that they also apply to functions 
which are not necessarily Lipschitz, and indeed, not well-behaved in any classical 
sense, as is demonstrated for instance by Talagrand's treatment of the concentra- 
tion problem for the longest increasing subsequence of a random permutation f |lU7j . 
section 7.1). 

A major success of the geometric approach to measure concentration came with the 
simple proof of the famous Dvoretzky theorem by Milman jTHj- To state Dvoretzky's 
theorem, we first need a definition: A Banach space {E, \\ ■ ||) is said to contain a 
subspace (1 + e)-isomorphic to M'^ if there are vectors vi, . . . ,Vk in E such that for all 
t = {ti, . . . ,tk) G M^, 



:i-e)\\t\\ < 



k 



<(l+^)lltll. 



J2u 

i=l 

Dvoretzky's theorem is the following very important result: 



Theorem 2.13 [Dvoretzky's Theorem j3BJ For each e > there exists ri{e) > such 
that every Banach space E of dimension n contains a subspace (1 + e) -isomorphic to 
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where k = [ri{e) logn]. 

This result was at the center of the vigorous activity around the local theory of Banach 
spaces in the decades 1970-90. Milman's treatment and subsequent developments 
show that Dvoretzky's theorem can be viewed as a manifestation of the measure 
concentration phenomenon in a certain sense. We refer the interested reader to the 
lecture notes [77j for extensive details. 

For further details about the functional analytic aspect (which is not relevant to 
this dissertation), we refer to Milman [ZEj and Chapters 2 and 3 of Ledoux [F)T] . 

2.6 Concentration on groups 

A group G is called "topological" if it is endowed with a topology which makes the 
group operations continuous. It is a classical result that on any compact topological 
group G, there exists a unique probability measure fi which is left and right invariant; 
that is, if X is a G-valued random variable following the law fi, then xX and Xx have 
the same law as X for every x G G. It also follows that X~^ has the same law. This 
measure fi is called the "Haar measure on G" (or the "normalized Haar measure" in 
some texts, but Haar measures will always be normalized for us). For a self-contained 
proof of the existence and uniqueness of Haar measures on compact groups, we refer 
to Rudin jHSI, Theorem 5.14. 

There is not much literature on the concentration of Haar measures. Before stating 
whatever little we could find, let us clarify that the works of Gromov & Milman |17] 
and Pestov |5H IH^ are about a different kind of "concentration" on groups, which is 
not related to our investigation. 

One early result is due to Maurey (TO], who investigated the Haar measure on the 
group Sn of all permutations of n elements. 

Theorem 2.14 [Maurey jTO]] Let ^ denote the uniform probability measure on Sn- 
Let dn denote the normalized Hamming metric on Sn-' dn{<J,T) = ^ Yl^=i^Wi'^) 
r(i)}. Then, for any A C Sn such that fJ^{A) > 1/2, and any t > 0, we have 
fJ^{At) > 1 — 2e~"*^/^'', where At = {a E Sn '■ dnicTyA) < t}, as usual. Consequently, 
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for any function / : — M which satisfies — /(t)| < c/„((T, r) for all cXjT & 

we have 



where nif is the median of f with respect to jj,. 

Maurey's result was generalized in the lecture notes of Milman and Sclieclitman ( j77j , 
Theorem 7.12) using the classical martingale argument to give the following theorem: 

Theorem 2.15 [Milman & Schechtman Let G he a group, compact with respect 
to a translation invariant metric d. Let G = Gq ^ Gi ^ ■ ■ ■ ^ Gn = {1} be a 
decreasing sequence of closed subgroups of G. Let be the diameter of Gk-i/Gk, 
k = 1, . . . ,n. Let ^ be the Haar measure and let f : G ^ be a function satisfying 



where At := {x e G : d{x, A) <t}. 

Maurey's theorem may be easily recovered by considering the tower Sn ^ >S'„_i 3 
■ ■ ■ ^ 'S'o = {1}. Although the theorem looks pretty general, it is not very clear what 
sorts of examples it might cover. In fact, we could not find in the literature any 
interesting application of this result other than the original application to rederive 
Maurey's theorem. Still, we decided to include it in the review because it has the 
looks of a powerful general result which has not yet been fully exploited. 

Next, let us discuss a result of Talagrand about the concentration of the Haar 
measure on Sn from his seminal work |lU7j . For each A ^ Sn and a E Sn, define 

[/^(o-) := {s e {0, 1}" : for some t e A, I{r(i) ^ a{i)} < Si for each i}. 



H{\f -TJifl >t} < 2e 



'/^^ forallt>0, 



|/(x) — f{y)\ < d{x, y) for all x,y & G. Then for allt>0 




Moreover, if A C G is such that > 1/2, then for all t > 
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Let Va(o") be the convex hull of f/4(cr) in [0, 1]", and let 



f{A,a):=mf{\\sf:seVA{(r)}. 



Then, Talagrand (|1Q7|, Theorem 5.1) has the following result: 



Theorem 2.16 [Talagrand |l()7j ] For every A C 5*^, we have 



Is, 



e/(^,^)/i6 



(i/i(cr) < 



1 



where jj is the uniform probability on Sn- 

As the reader might have observed, it is not clear what is going on. Talagrand does 
not provide any example for the above theorem in his paper, although applications 
should be quite similar to those for product measures, because the setup is more or 
less the same. A workable corollary of this theorem was devised by McDiarmid [7?. 

Finally, let us state a result from Gromov & Milman about the concentration of 
the Haar measure on SOn — the group of nxn orthgonal matrices with determinant 1. 

Theorem 2.17 [Gromov & Milman |17j] Consider the group SOn of nxn orthogonal 
matrices with determinant 1. Let d be the Hilbert- Schmidt metric on SOn, defined as 
d{A, B) = ~ ^ij^V^'^' where A = (aij) and B = (bij) are any two elements 

of SOn- Let fi be the Haar measure on SOn- Then, for any set A C SOn such that 
fi{A) > 1/2, and for any t >0, we have 



In particular, if f : SOn —>-M. is a function such that \ f{A) — f{B)\ < d{A, B) for all 



of f with respect to ji . 

This result was used by Voiculescu |114j in his seminal work connecting free proba- 
bility theory to random matrices. However, it is not of much use for a quantitative 




A, B, then - m/| > t} < 



(n-i)tV8 for (,ach t > 



where mj is the median 
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analysis of the eigenvalue spectrum. In section 14 .61 we shall develop a new method, 
using the rank metric, defined as d{M, N) = rank(M — N), for analyzing the concen- 
tration of unitary matrices which will be suitable for studying spectral distributions 
of related random matrices which arise in free probability theory. 

2.7 Stein's method of exchangeable pairs 

Stein's method was introduced by Charles Stein jlUlj in the context of normal ap- 
proximation for sums of dependent random variables. Stein's version of his method, 
best described as the "method of exchangeable pairs" , attained maturity in his later 
work |lU2j . A reasonably large literature has developed around the subject, but it has 
almost exclusively developed as a method of proving distributional convergence with 
error bounds. Stein's attempts at getting large deviations in |l(J2j did not, unfortu- 
nately, prove fruitful. The main purpose of this dissertation is to outline a simple 
way of deriving concentration inequalities using the method of exchangeable pairs, 
and applying it to problems involving dependent variables. 

We shall now briefiy describe Stein's method for distributional approximation. 
Suppose we want to show that a random variable X taking values in some space 
X has approximately the same distribution as some other random variable Z. The 
procedure involves four steps: 

1. Identify a "characterizing operator" T for Z, which has the defining property 
that for any function g belonging to a fixed large class of functions, E,Tg{Z) = 
0. For instance, if X = M and Z is a standard gaussian random variable, 
then Tg{x) := g'{x) — xg{x) is a characterizing operator, acting on all locally 
absolutely continuous g with moderate growth at infinity. 

2. Construct a random variable X' such that (X, X') is an exchangeable pair. 

3. Find an operator a such that for any suitable : X ^ M, ah is an antisymmetric 
function (that is, ah{x,y) = —ah{y,x)) and 



E{ah{X, X') \X = x)- Th{x) \ < Eh, 
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where Eh is a small error depending only on h. 

4. Take a function g and find h such that Th{x) = g{x)—E,g{Z). By antisymmetry 
of ah, it follows that E(a/i(X, X')) = 0. Combining with the previous step, we 
get \Eg{X)-Eg{Z)\<eh. 

Note that the operator S defined as 

Sh{x) := E{ah{X,X')\X = x) 

is a characterizing operator for the distribution of X. Thus, the basic principle of 
Stein's method is to prove the closeness of the distributions of X and Z by showing 
that the characterizing operators are close. 

Stated differently, this can be viewed as the study of stationary distributions of 
reversible Markov chains using their generators, but differences exist. The essential 
difference is that Stein's method involves only the pair {X, X') instead of the whole 
chain. The restriction of attention makes a lot of conceptual and practical difference. 

There are other variants of Stein's method, most notably the dependency graph 
approach popularised by Arratia, Goldstein and Gordon 0, and the size-biased and 
zero-biased couplings of Barbour, Hoist & Janson jTU], but we shall not discuss those. 

Stein's method has been successfully used to prove convergence to gaussian and 
Poisson distributions in various situations involving dependent random variables. 
(Poisson approximation by Stein's method was introduced by Chen and became 
popular after the publication of E]-) It is not our purpose here to go deeply into 
the regular versions of Stein's method. For further references and exposition, we refer 
to the recent monograph jSU]. For applications of the method of exchangeable pairs 
and other versions of Stein's method to Poisson approximation, one can look at the 
survey paper by Chatterjee, Diaconis & Meckes 



Chapter 3 

Theory and examples: Part I 



In this chapter, we shall concentrate on the "abstract" part of our theory. The more 
directly applicable part will be presented in the next chapter. 

Each theorem will be followed by a demonstrative example, which will always be 
the easiest application, involving sums of independent random variables. Let us now 
briefly mention the examples that we shall work out in this chapter. 

In section 13.31 we shall obtain a simple and explicit concentration bound for m — 
tanh(/?m), where m is the magnetization in the Curie- Weiss model of ferromagnetic 
interaction (which will be defined in that section for the reader who is not familiar 
with ferromagnetic models). By a generalization of this example, we shall derive 
in section 13.41 a broad condition under which a mathematically dubious but often 
useful technique from physics, known as the naive mean field equations, is valid. In 
section ESI we shall work out an example about estimation in models of dependent 
data (like Markov Random Field models) from mathematical statistics. In section 
13.71 we shall apply our techniques to obtain tail bounds of the correct order for 
(a) the number of fixed points in a random permutation, and (b) the Spearman's 
footrule distance between a random permutation and the identity. Finally, in section 
13.101 we shall obtain a temperature-free concentration result about the Sherrington- 
Kirkpatrick model of spin glasses. 

Throughout the remainder of this thesis, unless otherwise specified, the following 
notation will remain fixed and understood without mention: 
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• X is a Polish space and X is a random variable taking values in X. 

• / : X ^ M is a measurable map. The object of interest is f{X), and we 
assume, without loss of generality, that E/(X) = 0. (We shall abandon this 
last assumption on certain occasions.) 

• X' is another random variable defined on the same probability space as X, such 
that {X, X') is an exchangeable pair. 

• F : X^ — s> M is a measurable map such that F{X,X') = —F{X',X) almost 
surely (i.e. F is antisymmetric) and E,{F{X, X')\X) = f{X). For now, we shall 
assume that both F and / are known. In Chapter 01 we shall investigate a 
method for obtaining F from a given / using couplings. 

• We associate a function v with /, which will serve as a seemingly crude but 
useful "stochastic bound on the size of /(X)^": 

v{x) := ^E(|(/(X) - /(X'))F(X,X')||X = x). (3.1) 

The general principle of Stein's method of exchangeable pairs is to study the behavior 
of /(X) using F{X, X'). Exchangeable pairs often have natural constructions, and the 
theory usually gives us ways to infer about /(X) if we know things about F(X, X'); 
however, getting information about F is usually the difficult part in practice. As 
mentioned before, we shall deal with that in Chapter |3] 

In the next two sections we are going to present two basic theoretical results. We 
shall work out our first serious example in section 13.31 

3.1 A variance formula 

In this section, we establish a formula for the variance of /(X). The importance 
of this simple formula lies in the fact that in a sense, it contains the essence of our 
theory; further embellishments are technical. 
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We begin with the following fundamental lemma, which is the foundation of all 
further investigation: 

Lemma 3.1 For any measurable /i : X — >• M such that K\h{X)F{X, X')\ < oo, we 
have 

E{h{X)f{X)) = lEmX) - h{X'))F{X,X')). 

Proof. Note that E{h{X)f{X)) = E{h{X)F{X, X')). Using the exchangeability of 
X and X', and the antisymmetric nature of F, we have 

E{h{X)F{X,X')) =E{h{X')F{X',X)) = -E{h{X')F{X, X')). 

Thus, 

E{h{X)F{X,X')) = ^E{ih{X) - h{X'))F{X,X')). 
This completes the proof. □ 

An immediate consequence of the above lemma is the following identity, which 
may be viewed as a weak law of large numbers for exchangeable pairs: 

Theorem 3.2 With the same notation as above, we have 

Var(/(X)) = ^E((/(X) - f{X'))F{X,X')) 
whenever E{f{Xy) < oo. 

Proof. Recall that E{f{X)) = by assumption. Thus, we can directly use Lemma im 
with h = f. □ 

Remark. At this point, we should mention that Reinert [Sni has a technique for 
proving weak law type results for process valued functions using Stein's method, which 
she used to prove weak convergence of certain types of empirical processes. However, 
Reinert takes the usual Stein's method approach of treating the concentration problem 
as a problem of distributional approximation, where the target distribution is a point 
mass. Proceeding along this route cannot give exponential tail bounds. 
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As a second remark, note that from the above formula, we easily get E(/(X)^) < 
E(t>(X)), where v is as defined in (|3.1|) . 

Example. To quickly see how this works, let X = Yll^=i where Fj's are inde- 
pendent square integrable random variables. Let Hi = K(Yi) and af = Var(Fj). An 
exchangeable pair is created by choosing a coordinate / uniformly at random from 
{1, . . . , n}, and defining 

x' = J2y, + y;, 

where Y/, ■ ■ ■ lY^ are independent copies of Fi, . . . , Y^. Let F{x, y) = n{x — y). Then 

^ n n 

E(F(X, X') iFi, • • • , = - 5^ nn{Y^ - F/) l^i, . . . , F„) = ^(F, - ^^^). 

i=l i=l 

Thus, we have f{x) = x — fii. The theorem now gives 

n n 

Var(X) = |e((X - X')^) = ^(^^ " ^/)' = E 

Note that this example was meant to be just a basic illustration. It requires inde- 
pendence of the Yi's, while for the variance formula to hold, we just need them to be 
uncorrelated. Substantial examples will be provided in sections 13.31 and 13.41 



3.2 An exponential inequality 

Before working on further examples, let us describe our first concentration result, 
which is an exchangeable pairs version of the classical Hoeffding inequality (Theorem 
12.11) . Also, from now on we shall assume that 

E(e^^(^)|F(X,X')|) < oo for each 9 eR (3.2) 

in all our theorems, with the exception of Theorem 13.141 which is designed for situa- 
tions where the above condition does not hold. The validity of this assumption will 
be trivial to check in most of our examples. 
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Theorem 3.3 If C is a constant such that \v{X)\ < C almost surely (where v{X) 
is defined in h3.1\) ). then we have ^{e^^^-^'^) < (f^'^l'^ for all 9, and consequently, 
F{|/(X)| >t}< 2e-*'/2C g^^/^ ^ > 0. 

Remark. A version of this resuh about reversible Markov kernels was observed by 
Schmuckenschlager (Note that an exchangeable pair is formally equivalent to a 
reversible kernel.) However, Schmuckenschlager restricts himself to a very special class 
of kernels (those with "positive curvature" in the sense of Bakry and Emery ^) and 
the resulting technique becomes almost unusable in most practical problems. Indeed, 
the paper j^Hlj though imaginative in many ways, has few concrete applications. 

Example. To see that Theorem 13.31 is a generalization of Hoeffding's inequality, 
consider X = Y17=i where Yi's are now independent random variables with K{Yi) = 
fii and \Yi — < Cj almost surely, where /Xj's and Cj's are finite constants. Creating 
the exchangeable pair as in the example following Theorem 13.21 and putting f{X) = 
X -E{X), we get 

n 

v{X) = -J2nn{Y:-Y^'\X) 

i=l 

^ n n 
i=l i=l 

and so our theorem gives the usual Hoeffding bound (Theorem 12.11 in Chapter E)) in 
this case. 

Proof of Theorem 13. 3L Let m{9) := £(6^-^*^"^^) be the moment generating function 
of f{X). We can differentiate m{6) and move the derivative inside the expectation 
because of the integrability assumption ()3.2j) . Thus, by Lemma f3.H we have 



= E(e''^(^V(^)) 

= iE((e^/W-e^^(^'))F(X,X')). 
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Now note that for any ?/ G M, 




< / (te^ + (1 - t)e^)cit (by the convexity of u e 
Jo 



n 



2 



(3.3) 



Using this inequahty, and the exchangeabhty of X and X', we get 



\m'{e)\ < UE((e''^W +e^^(^'))|(/(X) -/(X'))F(X,X')|) 

< ME(e^^Wt;(X) + e^^(^%(X')) 
= \e\E{e'^^''^v{X)) < c\e\m{e). 

For > 0, this means ^logm(^^) < C6. Since m(0) = 1, we have logm(^^) < C6'^/2. 



3.3 Example: Curie- Weiss model 

In this section, we shall work out concentration bounds for the magnetization in 
the Curie- Weiss model of ferromagnetic interaction. This is arguably the simplest 
statistical mechanical model of spin systems. For a detailed mathematical treatment 
of this model, we refer to the book 41j by Richard Ellis. In the next section, we shall 
extend our technique for the Curie- Weiss model to a more general class of models 
with quadratic interaction. 

The state space is { — 1, 1}", the space of all possible spin configurations of n 



Given t > 0, we can choose 9 = t/C and get 



Similarly, P{/(X) < -t} < e 



□ 
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particles. A typical configuration will be denoted by cr = (di, . . . ,cr„). The Curie- 
Weiss model at inverse temperature (3 and a fixed external magnetic field h pre- 
scribes a joint probability density (the Gibbs measure) for these spins by the formula 

Pf3,h{o') = Zp\e~^^'^^"'> where Zf^^^ is the normalizing constant, and 



is the Hamiltonian for the system. We shall henceforth assume that /? and h are fixed 
and omit the subscripts. One quantity of interest is the magnetization, defined as 



It is common knowledge in the physics circles (and proved rigorously in ^T]) that 
if n is large and a is drawn from the Gibbs measure, the random variable m(cr) is 
concentrated in a neighborhood of the set of solutions of the equation 



The equation has a unique root for small values of (3 (the "high temperature phase" ) 
and multiple solutions for f3 above a critical range (the "low temperature phase"). 
For example, when = 0, /5c = 1 is the critical value. 

In this context, we can use Theorems 13 . 21 and 13 . 31 quite easily to prove the following 
result: 

Proposition 3.4 The magnetization m in the Curie-Weiss model satisfies 





n 



i=l 



X = tanh(/?x + ph). 



E{m-tanh{pm + ph)) < 




Moreover, we also have the concentration bound 



P{ 



m — 



tanh(/3m + ph)\ > - + t} < 2e 



ni^/ (4+4/3) 
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Here E and P denote expectation and probability under the Gibbs measure at inverse 
temperature (3 and external field h. 

Remark. Although the Curie- Weiss model is considered to be the simplest model of 
ferromagnetic interaction, we haven't encountered any result in the literature which 
gives bounds like the above. In fact, it is unlikely that there exists a short and simple 
bare-hands argument which gives such bounds. 

The classical way to solve the Curie- Weiss model uses ideas from large deviations 
theory, and involves a variational problem in the limit (see Ellis [H], Chapter IV). 
The magnetization is obtained by the usual technique of differentiating the free energy 
function. Although the limiting result is straightforward, getting finite sample tail 
bounds will not be easy along this route. 

Proof. Suppose a is drawn from the Gibbs distribution. We construct a' by taking a 
step in the Gibbs sampler as follows: Choose a coordinate / uniformly at random, and 
replace the I^^ coordinate of a by an element drawn from the conditional distribution 
of the P^ coordinate given the rest. It is well-known and easy to prove that (cr, a') is 
an exchangeable pair. Let 

n 

F{a,a') :=5^(a,-aD. 

i=l 

Now define 

mi{a) := - a,-, i=l,...,n. 

Since the Hamiltonian is a simple explicit function, the conditional distribution of 
the z*^ coordinate given the rest is easy to obtain. An easy computation gives 
IE(o'j|{o"j, j 7^ i}) = tanh(/5mj + ph). Thus, we have 

1 " 

f{a) = E{F{a,a')\a) = - - E{a,\{a„j ^ i})) 

1 " 

= m tanh(/?mj + (3h). 

^ i=l 
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Now note that \F{a,(j')\ < 2, because a and a' differ at only one coordinate. Also, 
since the map x tanh x is 1-Lipschitz, we have 

l/M - < \m{a) - mK)| + ^ V |m,(a) - m,K)| < ^il±^. 

n n 

i=l 

Thus, by Theorem 13.21 we have 

Varfm - ^ Vtanh(/3mi + ph)] < ^ii±^ 
^ i=i ^ 

and from Theorem 13.31 

1 " 

P{ |m ^ tanh(/5mi + | > t} < 2e" 



-nt2/{4+4/3) 



i=l 



Finally note that for each i, by the Lipschitz nature of the tanh function, we get 

I tanh(/5mj + I5K) — tanh(/3m + i5K)\ < [3\mi — m\ <—. 

n 

This completes the proof of the theorem. □ 



3.4 Validity of naive mean field equations 

The proof of Prop osit ion 13 . 41 shows that there is no reason why the argument will not 
generalize to the Hamiltonian 

n 

H{a) := - ^ JijCfiCrj - ^ Kai (3.4) 

where J = {Jij)i<i,j<n is a symmetric interaction matrix with zeros on the diagonal, 
and hi, . . . ,hn are fixed real numbers. In this section, we are going to carry through 
an extension of our treatment of the Curie- Weiss model to find conditions under which 
a certain widely used (but mathematically dubitable) technique from physics — the 
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naive mean field equations — is valid. 

Throughout this section, we shall use the standard physical notation (■) to denote 
the expectation with respect to the Gibbs measure. 

Now, it is an easy fact that E/3(crj|{crj, j 7^ i}) = taiah{P Yl^=i Jij^j + ^j)- This 
gives what are called the Callen equations (see e.g. Chapter 3 of |82 ): 



These equations are somehow just a restatement of the model, and not useful for 
anything deeper. The naive mean field equations are a modification of the Callen 
equations. The general physical intuition is that, when the spins are sufficiently 
"uncorrelated" , the expectation on the right hand side of the Callen equations can 
be taken inside the tanh. This gives 



The reason why this is written within quotes is that no reasonable general condition is 
known for the validity of these equations. Physicists know that these are not valid in 
models with short range interactions like the Ising model, but generally work when the 
interactions are of very long range (like the Curie- Weiss, for instance). Rigorous texts 
like jH] prefer to avoid talking about mean field equations. Still, this technique is used 
with reasonable measures of success in a variety of fields, including computer science, 
image processing, neural netwroks and other computational sciences. A system of n 
equations may seem strange at first sight, but the fact is that they often reduce to a 
much smaller system in practice (for instance, in the Curie- Weiss model, they reduce 
to just one equation). For a recent survey of applications in areas outside of physics, 
one can look at |^. For the physical perspective, a classical text is Parisi's book \%2\ . 

We shall now show that, although it is difficult to obtain general conditions under 
which the average spins {(cTj), i = l,...,n} satisfy the mean field equations, the 




(3.5) 




(3.6) 
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conditional averages of the spins, defined as 

(ai)' ■.= E{ai\{aj,j i}) = tanh{P^Jijaj + phi), i = l,...,n (3.7) 

j 

satisfy (j3.6p with high probabihty whenever the Hilbert-Schmidt norm of J (denoted 
by ll^ll/zs) is not too large. The precise condition is 



' (3-8) 

We shall also show that the mean field equations ()3.6p themselves are approximately 
valid at sufficiently high temperature in the class of models satisfying (|3.8|) . The 
results are summarized in the following theorem: 

Theorem 3.5 Fix p>0. Let p = \\J\\hs = JW'"^ ■ (^*)~' ^ = 1, • • • ,^ 
defined as in \3. ?| ). For 1 < i < n, define 

£i '■= {<^i)^ "~ tanhl P Jiji'^j)^ + phi j , and 

V j=i / 

£i ■= {o-i) - tanhf /5 ^ Jij{crj) + /3hi\. 

V j=i / 

Then we have the bound {ej) < 2/5^(1 + [3p) Y^j=i ^ij /^'^ ^'^^^ ^'^^ hence 

1 " 1 

-^(e,2)<-2(l + /3p)/3V. (3.9) 



n — ' n 



We also have the tail bounds 
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for 1 <i <n. Moreover, if (3 < 1/p, we have 

1 " 2(l + /?p)/3V 

Discussion and examples. Before we come to a critical discussion, let us, for a 
moment, gloat over the fact that it is difficult to imagine that such bounds (even the 
second moment bounds) can be easily obtained by bare-hands arguments, or by any 
available technique, for that matter! 

The first part of the theorem says that under ()3.8|) . the conditional averages 
{(cTj)^, « = !,... ,n}, instead of the unconditional ones, satisfy the mean field equa- 
tions (j3.6p . In other words, we have a set of modified mean field equations 

"(a,)- =tanh('/5^ Jy(a,)-+/3/i,Y z = l,...,n", (3.12) 

which are approximately valid with high probability. In particular, if each (cTj)" is 
concentrated around its mean, then we can recover the mean field equations from the 
information in ()3.9p . One condition under which this happens is the "high tempera- 
ture conditon" /3 < 1/p, which gives the second part of the theorem. But presumably 
the concentration of {(Jj)^ can happen under other conditions also. 

The condition ()3.8|) can be interpreted as some sort of a condition of long range 
interaction, as will be made clear by the following example. Consider a graph G = 
(V, E) on V = {1, . . . , n} with constant degree r. The (normalized) Ising Hamiltonian 
on { — 1, 1}^ is defined as 



n 



where h is the external field. Thus, Jij = 1/r if G E, and = otherwise. Since 
the graph has constant degree r, a simple computation gives || J||_h-5 = \pnfr. Thus, 
the modified mean field equations ()3.12j) hold whenever r ^ n^/^, which is a long 
range condition. 
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For another example, consider the following generalization of the Curie- Weiss 
model: Suppose each Jij is bounded in magnitude by 1/n. Then ||J||_f/5 < 1, and 
therefore the modified mean field equations hold. Moreover, the second part of the 
theorem shows that in this class of models, if (3 < 1, then the mean field equa- 
tions ()3.6|) are valid. 

Proof of Theorem 13.51 Throughout this proof, we shall use the fact that tanh is 
a Lipschitz function without explicit mention. Also, as a natural extension of the 
notation in the previous section, we shall write 

n 

rrii = mi{a) := ^ Jijaj + hi. 

Note that (o".t)~ = tanh(/5mj). 

Now fix i, 1 < i < n. Construct {a, a') as in the proof of Proposition \iA\ by 
choosing a from the Gibbs measure and taking a step in the Gibbs sampler to get a' 
as follows: Choose a coordinate / uniformly at random, and replace the I^^ coordinate 
of 0" by a random sample from the conditional distribution of aj given {o"j,j 7^ /}. 
Define F{a, a') = n YTj=i Jiji^j ~ ^'j) — ^Jiii.^i ~ ^'i)- Then, as before, 

n n 

f{^) = J-iji^i " tanh(/3mj)) = rrii — Jij tanh(/3mj) — hi. (3.13) 
Now, if cr and a' differ at site /, then 

n 

- f{a')\ < 2\Ju\ + l^.,(tanh(/3m,(a)) - tanh(/3m,(a')))| 
i=i 

n 

< 2|J,,| + ^|J,,||/?(m,((T) - m,-(a'))| 

n 

= 2| Jj/I +2j3 \JijJji\. 
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Also, |F(cr, o"^)! < 2n\Jii\. Thus, we have 



v{a) = ^E{\{f{a)-f{a'))F{a,a')\\a) 

^ n n 

< -J2{2\M+2(3Y,\J.,JMJ, 



2 

k=i j=i 



fc=l j,k j,k 

n 

= 2{l + (3p)J24- (3-14) 

i=i 

Thus, by Theorem 13.21 and (jSHSl), we get 

n n 

((m, - 5^ J,,tanh(/3m,) - h,)') = (/(a)^) < 2(1 +/3p)^4. (3.15) 

i=i i=i 

To complete the proof of the first set of inequalities in Theorem 13.51 observe that 

n 

< (3'^{{mi - J^jtanh(/3m-,) - /li)^), 

and combine with ()3.15|) . The tail bound on Si follows from the same analysis, using 
Theorem 13.31 

Next, for each i define the function gi : M" M as gi{x) = Yl]j=i Jij tanh(/5xj) + /;,j. 
Define (7 : M" — > M" as g{x) := {gi{x)^ . . . , gn{x)). Then note that for any x,y & M", 

n / n \ 2 



i=i ^j=i 

i=i ^7=1 / 



32211 „,||2 



< /?V||x- 
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Thus, if (3 < 1/ p, then g is a. contraction map with respect to the Euchdean metric. 
Therefore by the well-known theorem about contraction maps (see, e.g. Theorem 9.23 
in Rudin [HI]), g has a unique fixed point, which we shall call x*. Now note that for 
any x E M", 

||x - x*\\ < \\{x- g{x)) - {x* - gix*))\\ + \\g{x) - g{x*)\\ 
< \\x — g{x)\\ + Pp\\x — x*\\. 

Thus for any x, we have II X— a;* II < {l—Pp)~^\\x—g{x)\\. Applying tox = (mi,...m„), 
and remembering that {{rrii — a)^) is minimized at a = (mj), we get 

n n 

J](K-K))2)<5^(K-<)^) 

1=1 

- (1 _ - Yl '^iitanh(/3m,) - /i^)') 



1=1 i=l 



- ~n ^TyT inequality l\M^- 

(1 - PpY 

Finally, observe that 

n n 

J^e- = ^((tanh(/3mi)) - tanh(/3(mi)))^ 

i=l i=l 

n 
i=l 

Combined with the previous step, this completes the proof. □ 



3.5 Example: Least squares for the Ising model 

Consider an undirected graph G = {V,E), where V = {!,...,«}. Let r be the 
maximum degree of G. Let X = {Xi, . . . , Xn) be a random element of { — 1, l}*^. The 
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Ising model assigns a probability distribution for X according to the formula 



where 6 is an unknown parameter (the "inverse temperature" , usually denoted by f3 
in the physics literature), and Z{6) is the normalizing constant. The parameter space 
is Q := [0, oo). The natural statistical problem in this model is the following: How to 
make inference about 6 from a single realization of X? This is one of most elementary 
(yet analytically almost intractable) models of dependent discrete data. 

The classical maximum likelihood approach for this problem has been discussed 
in detail by Pickard JHB^. The main difficulty with directly computing the maximum 
likelihood estimator (MLE) is that the normalizing constant has no explicit form 
except in very special cases, and there is no polynomial time algorithm for exact 
numerical evaluation. 

In a well-cited paper, Geyer & Thompson ^B] devised a feasible Monte Carlo 
technique for computing the MLE in models like the above. One of the examples 
considered in that paper involves an instance of the autologistic model of Besag |T2], 
which is a generalization of the Ising model. The autologistic model of binary data 
assumes that the conditional distribution of each Xj given the rest can be modeled 
as a logistic regression, that is. 



where the a^'s and the /^j/s are known functions of a collection of unknown parame- 
ters. A simple verification shows that the Ising model described by (j3.16j) is a special 
case of the autologistic model. 

Monte Carlo algorithms for computing the MLE in such problems are widely used 
nowadays, but lingering doubts remain about the rates of convergence, specially in 
models with high degrees of dependence like the ones described above. 

A method which does not require simulations, but fell out of favor due to efficiency 
issues after the advent of Monte Carlo techniques, is the maximum pseudolikelihood 



Fe{X = x} = Z{9)-^e 




(3.16) 
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approach introduced by Besag [12]. Besag's pseudolikelihood function is defined as 



where fi{9\X) is the conditional density of Xi given {Xj,j ^ i) under Q. Now, if X 
is a gaussian vector such that Var^ (Xj | (Xj , j 7^ i)) is a constant independent of i and 
6*, the pseudohkehhood problem reduces to minimizing 



which may be called a "conditional least squares problem". We shall now show 
that somewhat surprisingly, minimizing Siff) to estimate Q may be a good idea even 
outside the gaussian framework. In particular, we shall show by way of example 
later in this section that it works in the Ising model, and the method can even be 
extended to construct nonasymptotic confidence intervals for Q. One of the interesting 
consequences is the following: 

Proposition 3.6 In the Ising model on a graph with maximum degree r as defined 
in \3.1(^) . we have the hound 



where C is a numerical constant. 

This shows that the conditional least squares inference for 9 should work in this model 
when r -^nj logn. However, there is a caveat: Although our results will show that 
Siff) is minimized near the true value of Q under pretty general conditions, they say 
nothing about the sharpness of the minimum of Siff). If S is too fiat, our results will 
be of no use. 

In the remainder of this section, we shall adopt the notation for the vector 
(xi, . . . , Xi-\, Xj+i, . . . , x„), where x is the original vector (xi, . . . , Let X = 

{Xi, . . . , Xn) be a random vector in [—1, 1]", whose distribution is parametrized by a 



n 



f{9\x) ■.= iimx), 




(3.17) 




for every 6* > 0, 
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parameter 6 belonging to some parameter space f2. For 1 < i < n, let 

fiiie,x') ■.= EeiXi\X' = x'), 

For each 6, let {O'ij{0)}i<i,j<n be an array of nonnegative real numbers with zeros on 
the diagonal such that for any x, y, i and 6 we have 

n 

and let 

M := -sup Vaij(^). (3.18) 

Lemma lTTI below will show that when M <^n, which holds in many models (including 
Ising type models), S{iIj) — S{6) is "approximately nonnegative with high probability" 
whenever 6 is the true value of the parameter and ip is any other value. This will be 
shown by decomposing S{iIj) — S{6) as A{iIj,6) + B{iIj,6), where A > and B ^ 
under P^. The exphcit expressions for A and B are as follows: 

n 

n ^-^ 

1=1 

and 

2 " - 

B{^, 9) = - 5^(^.(^, - /i.(V^, XO)(X, - /i,(^, X^)). 
n ^ — ^ 

1=1 

It is easy to check that indeed S{ip) — S{6) = A{ip,9) + B{ip,9). It is also apparent 
that A{tp, 6) > 0. The following lemma, the proof of which depends heavily on our 
techniques, shows that B{iIj, 6) ^ with high probability under 6 whenever M <^n. 
It is not completely honest, though, because the constants are bad. 

Lemma 3.7 Suppose X takes values in [—1, 1]*^ and let M, A and B he defined as 
above. Then for any ip,9 & fl and t >0, we have 
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Proof. Fix iIj,6 & Q. We produce X', as usual, by taking a step in the Gibbs sampler: 
A coordinate I is chosen uniformly at random, and Xj is replace by Xj drawn from 
the conditional distribution (under Pg) of the coordinate given X^ . Next, for each 
i, let 

gi{x') := fii{9,x') - fii{'iij,x'), 

and define 

n 
i=l 

Clearly, F is antisymmetric. Note that 

f{X) := Ee{F{X,X')\X) = 2Ee{gj{X'){Xj - X'j)\X) 

_ 2 

n 



Y^g,{X'){X,-f,,{e,X')). 



i=l 



Thus, in our notation, /(X) = B{ip,9). From the given conditions it easily follows 
that if X and y are elements of [—1, 1]" which differ only in the i^^ coordinate, then 



\f{x) - f{y)\ < - ( ^(a,,(^) + 2a^m + 1 



Also, quite clearly, \F{X,X')\ < 8. Combining, we have 

1 



V 



{X) = -M\U{X)-f{X'))F{X,X')\\X) 
16 16 



Invoking Theorem 13.31 completes the proof. □ 



The above lemma is only a moral justification for using the "conditional least 
squares approach" for estimating parameters in models of dependent data. It does 
not show that the true value of 9 is an approximate g'/o6a/minimizer of S{9). For that. 



CHAPTER 3. THEORY AND EXAMPLES: PART I 



47 



and also for constructing confidence regions, we need tail bounds on S{6) — inf^ S{4'). 
For instance, a (1 — a)-level confidence region for the true value of 6 can be defined 
as {9 : S{9) — inf^g^ Si^ip) < ta}, where ta is chosen such that for any 9, 

¥e{S{9) -mfSiip) > t^} < a. 

For doing all that, we need to introduce some more notation. Define a pseudometric 
(i on f2 as follows: 

1 

d{9, 9') := sup - J2 - /^^(^'' 

X IT' . -, 
1=1 

For every e > 0, let Nd{e) denote the minimum number of closed e-balls (w.r.t. the 
metric d) required to cover Q. We have the following result: 

Theorem 3.8 Suppose X takes values in [—1, 1]". Let S{9) and M be defined as in 
and and let be the covering number defined above. Then, for any 

e > and any 9 eVL, we have 

¥e{S{9) - inf S{i)) >Ae + t]< 2iVd(£)e-"*'/(^^*'+3') 

for all t > 0. Consequently, we also have 

MS(0) - inf Sm < inf (4. + 2 /(86M + 32)log(27V.(g)\ 

■ipen e>o V n J 

Application to the Ising model. For the Ising model described at the beginning 
of the section, it is easy to verify that fii{9,x^) = ta.nh.(^9J2j^N{i)^j)^ where N{i) is 
the neighborhood of i in G. Now, since tanha; G [—1, 1] for all x G M, we have 

\fii{9,x')-fi,{9,y')\<2 J2 H^J^VA- 

j&N{i) 

Thus, we can take aij{9) = 21{{i,j) G E} irrespective of 9. With this choice of a^'s, 
we have M = 2r, where recall that r is the maximum degree of G. 
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Now note that X]jeAf(i) ^ • • • ' tanh is an odd function. There- 

fore for any 6, 6' and x, 

\Hi{e,x') - fii{e',x')\ < sup |tanh(es) -tanh(0's)|. 

se{0,l,...,r} 

Since tanh is a Lipschitz function, therefore the right hand side is bounded by \6 — d'\r. 
On the other hand, since tanh is an increasing function bounded by 1, therefore 
the right hand side is also bounded by 1 — tanh(min{^, 0'}) < exp{—mm{6,6'}). 
Combining, we have 

n 

d{9, 9') = sup - V x') - fXi{9', x') I < mini 1^ - 9'\r, e" 



1=1 



Now fix any e G (0,1). Let L = — loge. Equipartition the interval [0, L] into 
[Lr/2e] + 1 subintervals of length < 2e/r each. The above bound shows that the 
end points of these subintervals, together with L, form an e-net with respect to the 
pseudometric d on [0, oo). Thus, 

A'.(.)<^ + 2. 

It is now easy to apply Theorem 13.81 The second bound directs us to the optimal 



choice of e, which we take to be e = ^Jr jn. Theorem 13.81 now gives 



T , ^ , „ . -niV(96r+32) 



^B{S{9)-miS(i)) >4W- + t} < 2Vr^(logn)e 

ip V n 



and 



Ee(5(^)-inf5(^))<Ci'''^°^'' 



ip \ n 

where C is a computable numerical constant. This proves Proposition 13.61 



Proof of Theorem EIHl Fix e > 0. Let k = Nd{e), and let 9i,...,9k be the 
centers of a collection of e-balls which cover Q. Then for every 9 there exists i such 
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that \S{9) - S{9i)\ < 4d{6,ei) < 4e. Thus, 



mill S{ei)-miS{9) < Ae. 

l<i<k 9 



Therefore, 



¥e{S{9) -miS{ip) > Ae + t} 



< M\S{9) - ^^i^^Sm >t}< Y.M\S{9) - Sm > t}. 



1=1 



The first bound follows from this and Lemma (3.71 Now note that for any 6 > and 
a > i/e, we have 



POO POO 

/ {ae-^'' A l)dt = ^/FH^ + / ae-'"'dt 

Jo J\/b-^ losa 



Thus, 



a/6 

< ^/b~^\oga+ / ae~^'^'dt 

= ^/b~Hoga^ , \ 

2b^/b-Hoga 

a (since a > \fe =^ 2 ^/\oga > 1 / v^loga) . 



POO 

EeiSie) - inf Si^P)) < Ae + (2Arrf(£)e-"*'/(96M+32) ^ ^^^^ 

Jo 

^^^_^^J{96M + 32)log{2N,{e)) 



n 

This completes the proof of Theorem 13.81 □ 

3.6 An inequality for self-bounded functions 

Theorem 13.31 can be extended in several ways. The following extension is analogous 
to existing results for the so called "self-bounded" functions. The terminology was 
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introduced by Boucheron, Lugosi & Massart in ^H]. Recall the notation of these 
authors that was described in section 12.11 of Chapter |21 They call a function "self- 
bounded" if V+ and V_ can be bounded by some function of Z, which is usually a linear 
function. Functions of independent random variables which satisfy the self bounding 
property appear in reasonable amounts in the literature; examples include suprema 
of nonnegative empirical processes and conditional Rademacher averages. Further 
details and references about concentration inequalities for self-bounded functions of 
independent random variables are available in ^211201 and |92j . 

We shall say a function / is self-bounded (w.r.t. X) if v{X) < Bf{X) + C for 
some constants B and C, where v is defined in (j3.1|) . 

In Section 1X71 we shall provide some applications of this result to obtain Bernstein 
type concentration for functionals related to random permutations (matching problem 
and Spearman's footrule). 

Theorem 3.9 Continuing with the notation introduced at the beginning of the Chap- 
ter, suppose B, C are finite positive constants such that v{x) < Bf{x) + C for each 
X. Then P{|/(X)| > t} < 2e-*V(2C+2Bt) ^ > 

Example. As usual, consider X = Y17=i where Y^'s are now independent random 
variables taking value in [0,1]. Let yUj = K{Yi). We shall use Theorem 13.91 to prove 
a version of Bernstein's inequality in this setup. As before, let f{X) = X — E(X). 
Let X' be constructed as in the example following Theorem 13.21 From previous 
computation, we know that 



V 



2^,E(F,|X)+E(F,2|X)). 



Using < Yi (because < < 1), we get 



V 



{x)<-j2mYi)+m\x)) 
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Thus, taking B = 1/2 and C = E(X), we get 



P{|X-E(X)| >t} < 2e 



•t2/(2E(X)+t) 



Note, for instance, if /ij = 1/2 for all i, then E{X) = n/2, and this bound is essen- 
tially equivalent to the Hoeffding bound. The increase in efficiency is apparent only 
when /ij's are small. 

Proof of Theorem 13.91 Proceeding exactly as in the proof of Theorem 13.31 we 
get 



Since m is a convex function and m'(0) = E(/(X)) = 0, therefore m'(6) always has 
the same sign as 6. Thus, for < 9 < 1/B, the above inequality translates into 



m 



'{6)1 < |0|E(e^^(^)t;(X)) 

< \e\E{e'^f^^\Bf{X) + C)) 

= B\e\m{e) + c\e\m{e). 



d_ 



logm(0) < 



l-BO 



ce 



Using this and recalling that m(0) = 1, we have 




Putting e = t/{C + Bt),we get 



>t}< exp{-9t + \ogm{9)) < e 



-t^/{2C+2Bt) 



The lower tail can be done similarly (though the argument is not exactly symmetric 
in this case). □ 
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3.7 Example: Matching problem and Spearman's 
footrule 

Suppose TT is chosen uniformly at random from the set of all permutations of {1, . . . , n}. 
It is a classical probabilistic fact (the matching problem) that as n ^ cxd, the distri- 
bution of the number of fixed points of vr converges weakly to the Poisson distribution 
with mean 1. Error bounds can also be obtained by various methods. The question of 
tail bounds for this random variable, with explicit constants, is a reasonable problem 
to look at. Although this question can presumably be tackled by Talagrand's tech- 
nique f |lU7j . section 5; also discussed in section |TH here) for random permutations, 
it is a good test case for our theory. 

Another interesting problem related to random permutations is the behavior of 
Yl^=i K ~ This statistic, known as Spearman's footrule (see, e.g. jHZ|), arises in 

the nonparametric theory of statistics. 

In the rest of this section, we shall work out concentration inequalities for a 
generalized version of these statistics, originating in an early work of Hoeffding [5T| . 

Let {ttij} be an n X array of real numbers, assumed to be in [0, 1] for our purposes. 
Let vr be a random (uniform) permutation of {1, ...,n}, and let X = XliLi '^*'r(«) • 
This class of random variables was first studied by Hoeffding ^T|, who proved that 
they are approximately normally distributed under certain conditions. The following 
proposition gives concentration inequalities in this setup: 

Proposition 3.10 Let {ajj}i<jj<„ be a collection of numbers from [0,1]. Let X = 

Yl'i=i ^i'rrii) ! wherfe ti is drawn from the uniform distribution over the set of all per- 
mutations o/{l, . . . ,n}. Then for any t > 0, P{|X - E(X)| >t}< 2e-*V(4E(x)+2t) _ 

Remarks. Note that the bound does not explicitly depend on n. This is a conse- 
quence of the assumption that the bounded. A version of this theorem sans 
the boundedness assumption can also be proved using our techniques, but then the 
bound will involve n. The classical concentration result of Maurey [7D], stated as 
Theorem 12.141 in the review section 12.61 cannot give a Bernstein type bound like the 
above. Talagrand's theorem ( jlU7j . Theorem 5.1) might, but it is not clear from the 
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abstract form whether it really does. McDiarmid's corollary jTzj of Talagrand's result 
certainly cannot give a result like the above. 

Example 1. (Matching problem.) Taking = I{i = j}, we get X to be the 
number of fixed points of tt. Since E(X) = 1, we get the exponential tail bound 
P{|X — 1| > t} < 2e"*^/(^+^*\ which does not depend on n. 

Example 2. (Spearman's footrule.) Sometimes, the boundedness of the 
overcome by just dividing by the maximum. We give one such example now. 

In nonparametric statistics, a standard measure of distance between two permu- 
tations TT and a is the Spearman's footrule, defined as 



p{h,(t) := 



TT U — O" U 



i=l 



A standard reference for the uses of Spearman's footrule in nonparametric statistics 
is the book 57 by Kendall and Gibbons. From a statistical point of view, it is of 
interest to know the distribution of the Spearman's footrule distance between the 
identity and a random uniform permutation. Diaconis and Graham j29j proved the 
following theorem: 

Theorem 3.11 [Diaconis and Graham [29j] Let p(7r, a) = Y17=i 1^(0 ~ ^(01- //^ ^■^ 

chosen uniformly, then 

Var(p) = ^(n + l)(2n^ + 7), and 

as n ^ oo, for each t G M. 

We shall use Proposition 13.101 to get finite sample tail bounds of the correct order. 
Without loss of generality, we can take a to be the identity permutation. Let 

\i — j\ 

aij = , I < i, j < n. 

n 
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Then < < 1. Let X = XliLi ^ji-C*) ~ P/^- Then by Proposition 13. lOl P{|X — 
E(X)| >u}< e-"V{4E(x)+2n)^ ^ ^ t^Var(X), and observe that 

X - E(X) _ p- E(p) 
v/Va^X) "7^^' 

By the Diaconis- Graham computation, E(X) = (ri^ — l)/3n and Var(X) = (n + l)(2 + 
7n^^)/45. Combining, we get the following tail bound: 

Proposition 3.12 Let p be the Spearman's footrule distance between the identity and 
a uniformly chosen permutation of {1, ... ,n}. For any t > 0, we have 



P 



p - E(p) 



VVarp 



where 



60(n-l) 

an = —, — -^30 as n —>■ oo, and 

n(2 + 7n~^) 

Pn = 2V45(n + l)-^/^{2 + 7n-'^)-^/^ ^0 as n ^ oo. 

Note that our tail bound is approximately a gaussian bound (and hence, of the cor- 
rect order) for large n. However, the constants are poor. The reason is that Propo- 
sition I3.1UI is designed for random variables that exhibit Poissionian behavior. Our 
X in this problem does not belong to that class, particularly because its variance is 
smaller than its mean by a factor significantly smaller than unity. Thus, although 
Proposition I3.1UI can be successfully applied to this problem to get concentration 
bounds of the correct order, it is not an ideal example. 



Proof of Proposition I3.10L Construct X' as follows: Choose /, J uniformly and 
independently at random from {1, . . . ,n}. Let vr' = tt o (/, J), where (/, J) denotes 
the transposition of / and J. It can be easily verfied that (vr, vr') is an exchangeable 
pair. Hence if we let 

n 

X' := Cbiw'ii), 

4 = 1 
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then {X, X') is also an exchangeable pair. Now note that 

1 Ti 

-E{n{X - X')\'k) = -E(a/^(/) + aj^(j) - a/^(j) - aj^^)\n) 



X -E(X). 



Thus, we can take f{x) = x ~ E(X) and F{x,y) = — y). Now note that since 
< CLij < 1 for all i and j, we have 

v{X) = -E((X-X')'|vr) 



4 

^ ^ ('^t7r(i) ~l~ ^jTr{j) ^iT^U) '^J7r(j)) 
id 

= X + E{X) = f{X) + 2E(X). 
Applying Theorem 13 . 91 with B = 1 and C = 2E(X) completes the proof. □ 

3.8 Inequalities for unbounded differences 

In this section, we present a some results which are applicable when v{X) is un- 
bounded. As usual, we shall supplement with trivial examples. A nontrivial applica- 
tion will be worked out in section 1^31 of the next chapter. 

The first result of this section is an extension of Theorem 13.31 which only requires 
reasonable bounds on the moment generating function of v{X), assuming that it 
exists. To be more specific, note that if v{X) is bounded, then 

lim L-MogE(e^^(^)) = ||t;(X)||oo. 
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Thus, the number r{L) defined as 

r{L) := L-MogE(e^^(^)) 

may serve as a surrogate for an actual bound on v{X), in situations where such a 
bound does not exist, or is not representative of the true size of v{X). The following 
theorem allows us the fiexibility of using r{L) with appropriate choice of L: 

Theorem 3.13 Suppose r{L) is defined as above. Fix any L > 0. Then we have 
F{|/(X)| >t}< 2e-*'/(2'-W+4ti-'/') for any t > 0. 

Remarks. The idea is to choose L so that ^ i^iL). In particular, observe that 

if v{X) < C almost surely for some constant C, we can take L — oo and get the 
bound in Theorem 13.31 

Example. Again, let X = where Fj's are independent zero mean random 

variables. However, we shall now drop the boundedness assumption, and assume only 
that Fj's have gaussian tails: In other words, assume that there exists ^ > such 
that E(e^^*^) < Kt < oo for each i for some fixed constant Kq. Choosing L = 29 and 
applying Jensen's inequality, we have 

r(L) < (2^^)-MogE(e^S"-i(^(^-')+^-')) 

n 

< (2^)-^ J]21ogE(e^^»') < O'^nlogKe. 

i=l 

The above result now gives the bound 

Note how this reduces to the Hoeffding bound when the Fj's are bounded. 



Proof of Theorem Let u{X) = e^f^^ym{9), where m{9) = E(e^^(^)), as 
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usual. Applying Jensen's inequality, we have for 6* > 



m'{e) < eE{e^f^''^v{X)) 

= L-^em(e)E 



Lv{X) 

u{X)[\og^^ + \ogu{X) 



< L-^em{e) logE(e^''W) + L-^^E(e^^W \ogu{X)). 

Recall that m'(0) = E(/(X)) = 0, m(0) = 1, and m is a convex function. Hence 
m{9) > 1 for all 9. Consequently, logu{X) < 9f{X). Using this in the above bound, 
we have 

m'{e) < r{L)em{e) + L-^e^m'{e). 

Written differently, this gives 

d , r(L)e 

for < ^ < L^/^. Now take any t > 0, and let 9 be the positive root of the equation 

r{L)e 



1 - 



t. 



Explicitly, we have 



- 1 



(3.19) 



2t \\l r{LYL 

Using the inequality \/\ + a < 1 + -/a, it is easy to see that < L^/^. Another 
application of the same inequality gives 



Vl + a - 1 - 



a 



2(1 + v^) 



> VTTa- 1 - 



a 



2VT+ 



> 0. 



a 
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Using this inequality in ()3.19p . we get 



r(L) +2tL-V2" 
Now, with the 6 defined in fj3.19|) we have 

r{L)u , r{L)e^ Ot 



r r(L)u , 
Jo i- ~ u 



Thus, by Chebychev's inequahty we get 

P{/(^) >t}< e-^*+i°s"(^) < e"^*/^ 

Using the lower bound on 6 derived above, we get the desired expression. The lower 
tail bounds are obtained by symmetry. □ 

Our next result is the exchangeable pairs version of the Burkholder-Davis-Gundy 
inequality fIT\ |221 from classical probability. In its simplest form, this inequality 
says that for a martingale difference sequence {Xj}i<j<„ adapted to some filtration 
{3^i}i<j<n, we have for any p > I the inequality 



|2p 

i=l i=l 



Note that the inequality is trivially an equality for p = 1. Essentially, this inequal- 
ity gives a p^^ moment expression of the notion that a martingale which is the sum 
of a homogeneous sequence of differences grows like ra^/^. (Analogously, the Hoeffd- 
ing inequality gives a tail estimate expression of this notion when the differences 
are bounded.) This is clear from the observation that the right hand side above is 
bounded by n^^^ XlILi ^l^jP''' = 0{n^) by Jensen's inequality, and so if the Xj's are 
of comparable sizes then E| ^Xjp^ = O(n^), which "shows " that = 0{m}^'^) 

upto p^^ moment accuracy. 
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Recently, Boucheron, Bousquet, Lugosi & Massart [20] have derived a useful ver- 
sion of this inequality for general functions of independent random variables (rather 
than just sums). Moment bounds are often useful when we can take p to be large 
(growing with n), because the resulting Chebychev inequalities give surprisingly effi- 
cient tail bounds. In fact, it is an easy fact that for any suitable random variable X 
and any t > 0, 

>t} < inf < inf 



p>i tP ~ d>o e^* ' 
which shows that optimized Chebychev bounds are better than optimized Chernoff 
bounds. For applications of the moment inequalities to obtain efficient tail bounds for 
a variety of complicated functions of independent random variables, we refer to 20j. 

We shall now present the exchangeable pairs version of the Burkholder-Davis- 
Gundy inequality. In the following, || ■ ||p will denote the norm of random variables; 
that is, for a random variables Y, \\Y\\p := (E.\Y\pY^p. 

Theorem 3.14 For any positive integer p, we have 

\\f{X)\\l<{2p-l)\\v{X)\\,. 

Example. To see why this can be called the Burkholder-Davis-Gundy inequality for 
exchangeable pairs, consider, as usual, sums of independent random variables. Let 
Yi, . . . ,Ynhe independent mean zero random variables with finite p^^ moment, where 
p is a fixed positive integer. Let X = X^iLi^*- Construct X' as in the previous 
sections, and recall that f{X) = X, F{X,X') = n{X - X'), and 



1 " 

{X) = -Y,{EiY^)+EiY^\X)). 



2 

i=l 

By the above theorem and simple applications of Minkowski and Jensen inequalities, 

we get 

il^ll2p<(2p-l)|Er/||^, 
1=1 

which is the classical inequality for sums of independent random variables. 
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Proof of Theorem 13.141 By Lemma 13.11 we have 

EifiXfn = ^E{{f{XfP-' - f{X'f^-^)F{X,X')). 
By the inequahty 

I 2p~l 2p-l| ^ '^P ~ ^ f 2p-2 , 2p-2\| I 

F —y |< — ^ — \x ^ +y )\x — y\ 

which foUows easily from a convexity argument very similar to ()3.3j) . we have 

E{f{Xfn < {2p-l)E{f{Xf^~\{X)) 
By Holder's inequality, we get 

E(/(X)2P) < (2p- l)(E(/(X)2f))(P-i)/P(E(t;(X)''))i/P. 
The proof is completed by transferring E(/(X)^p)*^^~^)/p to the other side. □ 

3.9 A refinement of the exponential inequality 

The following is a refinement of Theorem 13 . 31 which allows us, in particular, to derive 
concentration bounds in the Sherrington-Kirkpatrick model of spin glasses in the next 
section. 

Theorem 3.15 Define 

v^{x) := ^E((/(X) - f{X'))F{X,X')\X = x) and 
V2{x) := \E{{f{X) - f{X')y\F{X,X')\\X = x). 

Suppose C and e are constants such that |t>i(X)| < C and V2{X) < e a.s. Then 
P{|/(^)| > < 2e-^*'/(2c2+8rf) any t > 0. 
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Remark. Note the difference between v and Vi: the latter has no absolute value 
inside. This can be considered as a "second order refinement" of the original inequality 
in Theorem 13.31 

Proof. Recall the identity 

m'{e) = ^E((e^^(^) - e''^(^'))F(X,X')) 

that was derived using Lemma 13.11 in the proof of Theorem 13.31 Instead of simply 
using the inequality |e^ — e^| < |(e^ + e^)|x — ?/|, we now take a finer recourse: For 
any x <y, note that 

< l(ef + e^)(i/-x) -(e^-e^) 



y f^y _ 



X 



M - x) + e"" - e" du 



Thus, we have 



where 



y-x 

= \{e^ ^ e-){y - xf . 



e" - = \{e^ + e^)(x - ?/) + y), 



\b{x,y)\<\{e-^e^){x-yf. 



Combining this with the observation that [x — y)F{x,y) = — x)F{y,x), is it not 
difficult to deduce that for any 6' > we have 

This gives, for any 9 >0, 

[ogm{9) < + — . 
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Now fix any t > 0. Let 6 be the positive solution of tlie equation eO"^ + C6 = t. 
Explicitly, 

Then 

-et + \ogm{e) < ~e{ee^ + ce) + — + — 

2 3 

= < . 



Now note that for any a > 0, 



2^/TTa 2^rTa ~ 



Thus, 

Combining the steps, we get 

P{/(X) > t} < e"^*+^°§™(^) < e"^'''/^ < e-^*'/(2c2+8rf)_ 

The lower tail bound follows by symmetry. □ 



3.10 Application to spin glasses 

The Sherrington-Kirkpatrick (S-K) model of spin glasses considers the Hamiltonian 

n 

l<i<j<n i=l 



where gi/s are a fixed realization of a collection of i.i.d. standard gaussian random 
variables. For recent advances in the rigorous analysis of this model, we refer to 
Chapter 2 of the book by Michel Talagrand, and also his proof |112j of the 
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Parisi formula for the limiting free energy in this model. For the physical perspective, 
one should look at the book [T^ by Mezard, Parisi and Virasoro. 

In spite of the fact that the Parisi formula has been proved, very little is explicitly 
known about the low temperature phase of the S-K model at the time of writing this 
thesis. In fact, as Talagrand admits p. 182), "we do not even know where to 



The mean field equations discussed in section 13.41 are no longer valid in the S-K 
model, even at high temperatures. Instead, physicists think that a modification of 
the naive mean field equations ()3.6|) . the so-called TAP equations, hold for the S-K 
model. These follows: 



"(a,)=tanh^ J2 gij{crj) + h - - q){ai)] , i = l,...,n" (3.20) 



The validity of this conjecture has been established rigorously for sufficiently small (3 
by Talagrand f |lllj . Theorem 2.4.20). 

In this section, we shall prove that another set of mean field type equations are 
valid, irrespective of the temperature, in a class of models which will encompass the 
S-K model. For instance, we shall be able to show that in this class of models, the 
magnetization m(cr) = ^ Yl^=i satisfies 



start 



"! 




where q solves q = Etanh^(/3Zy/g + /;,), Z being a standard gaussian random variable. 



m 



(o") ^ — \ tanh(/3mj(cr)) with high probability. 




(3.21) 



1=1 



where mi{a) := n ^^'^ J2j<n j^idij'^j + h is the "local field" at site i. It will follow 
from the same argument that whenever k <C ^/n, we also have 




(3.22) 



i=l r=l 



i=l r=l 



where a^, . . . are i.i.d. from the Gibbs measure. The left hand side, called the 
"overlap" of the k configurations, is a particularly interesting quantity in the S-K 
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model when k = 2. 

However, we must admit that we have found no use for all this information till 
now. Still, it may be interesting just because the low temperature phase of the S-K 
model is a highly intractable object. 

Our results will be valid whenever the operator norms of the matrices J and 
J2 := {Jfj) are bounded. Recall that the operator norm of a matrix A is defined 
as 

:= max \\Ax\\. 

\\x\\=l 

Alternatively, it is the square root of the maximum eigenvalue of A'^A. In the S-K 
model, it is well-known that the norm of J is bounded in probability. In fact, it 
is known that with J*^") = (^~^^^5'ij)i<jj<n, where {gij}i<i<j<n are i.i.d. standard 
gaussian random variables and gji = Qij, we have 

2 in probability. 

A proof of this result, as well as tail bounds, can be found in the survey article by Bai 
|8|, section 2.2. It is also easy to see that the norm of J2 is bounded in probability in 
the S-K model, because 

n 

II J2II < \n max = 

since for each i, ^ Yl]=i dfj concentrated around ^{gfi) and the tails fall off sharply 
enough for the above to hold (a fact that can be easily proved using the Burkholder- 
Davis-Gundy inequality). The key result of this section is the following: 

Theorem 3.16 Let J be the interaction matrix in a model described by the Hamil- 
tonian Let J2 = {•Jij)i<i,j<n- Take any a = (ai, . . . , a„) G M". Fix (3 > 0, and 

let 

C = C{a,(3) ■.= 2{l+f3\\J\\+f3^\\J2\\)\\af, and 
£ = £{a,p) :=4(max |ai|) [l + (/3|| J|| + /3'|| Jail)'] ||af . 



— 

A — 1 



n 



E((yf^^) in probability. 
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For each i, let mi = mi{cr) := X]j=i '^ij'^j + 

n 

Y = aj(crj — tanh(/?mj)). 

i=l 

If a is drawn from the Gibbs measure at inverse temperature (3, then E(y) = 0, 
Var(y) < C, and for any t > 0, ¥{\Y\ > t} < 2e~^*'/(2C^'+8^*) . 

Applications. Taking ctj = 1/n for each i, we get ||a|p = 1/n. Fix (3 > 0, and 
let K = f3\\J\\ + /3^\\J2\\- As observed before, K = 0(1) in the S-K model. We 
have C = {2 + 2K)/n and e = 4(1 + K^)/n'^. Thus, if we let m{a) := i ^"^^ (Ti 
be the magnetization of a configuration drawn from the Gibbs measure at inverse 
temperature P, and let 

1 " 

F = m{a) tanh(/5mi(cr)), 

1=1 

where, as usual, mi^a) = Yl^=i Jij^i + the local field at z, then E(F^) < (2 + 
/n and 

P{|>"| >t} < 2e-"*'/("+''*), (3.23) 

where a = A + AK and 6 = 16(1 + K'^)/{1 + -ft') are free of n. This shows TTT^ . 

The relation (|3.22j) for the fc^'^-order overlaps can be treated exactly in the same 
way, by successively conditioning on (cr^, . . . , a''^^, o"''"'"^, . . . , a^), r = 1, . . . ,k, and 
getting 

n r—1 k ^ n r k 

-^n*^^M/?^^(^'^-'))n^^'^-En*^^M/3m.(a^-i)) n 

i=l s=l s=r i=l s=l s=r+l 

at the r^^ stage. Roughly, this can be carried out till k = o{^/n), since the errors 
accrued at each stage are like n~^/^. 



Proof of Theorem I3.16L As in the proof of Proposition 13. 4| we construct an 
exchangeable pair by taking a step in the Gibbs sampler chain as follows: First, draw 
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a from the Gibbs measure. Next, choose a coordinate / uniformly at random, and 
replace the coordinate of a by a sample drawn from the conditional distribution 
of the a I given {aj,j ^ I}. Let 

rL 

F{a, a') := n ^ ai{ai - cr-) = n{ai - a'j). 

i=l 

Then F is antisymmetric and 

n 

f{a) = E(F(a, a')\a) = ' I Wj^J ^ 0) 

i=l 
n 

= ^ caic^i - tanh(/5mi(cr))). 

i=l 

For any a e {—1, 1}" and I < j < n, let 

a^-^') := (ai, . . . , ai-i, -ai, di+i, . . . , (7„). 
Define /i : M — > M as /i(a;) := tanh(/5a;), and for each i,j, let 

kj = bij{a) := h{mi{a)) - h{mi{a^'-'^)), 

where mi{a) = Yl^=i Jij^j + ^ii ^ defined in the statement of the theorem. Now note 

that 

n 

f{a) - f{a^^'>) = 2ajaj - ^ aikj, and F{a, a^^^) = 2najOy (3.24) 

i=l 

Let pj = = P{(j^- = — CTj I a,I = j}. Combining everything, we have 

1 " 
v,(a) = -E((/(a) - /(a'))F(a,<7')|<7) = 2^a5p,- - a,6,,a,<7,p,-. (3.25) 
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It is easy to verify that ||/i"||oo < P"^- Therefore, 
<^K(a)-m,(a(^)))2. 



Now let Ci = Ci{a) := h'('mi{a)), and note that mi{a)—mi(a^^'*) = 2Jijaj. Combining, 
we have 



Finally, note that |cj| < (3. Using all this information, we get, for any x, ?/ G 



(3.26) 



< 2|| J|| {Y.{x.c.)T {Y.^y,a^)T + E i^^y^i^f^'Ji 

i=i hj 



i=l 



< (2/511 J||+ 2/5^11 J^IDIIx 



(3.27) 



Let K = 2/3|| J|| +2/3^11 J2||- Using the above inequality in equation (j3.25j) with Xi = ai 

and yj = ajajPj, we get 

\v,{a)\<{2 + K)J2o^l 



i=l 



The first assertion of this theorem now follows by Theorem 13.21 Next, we use (|3.24j) 
once again (and the inequality (a + 6)^ < 2a^ + 26^) to get 

V2{a) = ^E{{f{a) - f{a')r\F{a,a')\\a) 



i=l j=l i=l 

n n n 

<(max|ai|) 4 ^ a,^ + ajfeij)' 

L i=i j=i i=i 



(3.28) 
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Now, put Xi = ai and yj = "^^^i Oikbkj, i, j = I, . . . ,n. Then 

n n 
j=l 1=1 i,j 

< i^llxll ||y|| by inequality ()3.27j) . 
Thus, \\y\\ < K\\x\\, and therefore by (|3.28|) we have 

n 

V2{a) <2{msix\ai\){l + K^)y"al 

i=l 

The proof is now completed by applying Theorem 13.151 



Chapter 4 

Theory and examples: Part II 



We begin this chapter with Stein's observation |102j that an exchangeable pair [X, X') 
automatically defines a reversible Markov kernel P as 

Pf{x):=E{f{X')\X = x), (4.1) 

where / is any function such that E|/(X)| < oo. All other notation will be the same 
as what was defined at the beginning of Chapter 01 

The main purpose of the subsequent discussion is to connect the concentration 
properties of the distribution of X with the rate of convergence to stationarity of 
a Markov chain following the kernel P. In particular, information about the rate of 
decay of P^ f (x) — P^ f (y) , where (x, y) is any point in the support of (X, X'), can give 
us a bound on F{X, X') in a way that we are going to describe in the following pages. 
This is a direction that has not been systematically explored in the concentration 
literature, to the best of our knowledge. We shall also describe a coupling technique 
for getting a handle on the antisymmetric function F when / is known. 

The techniques so developed, will be applied in section 14.21 to get concentration 
under a famous "weak dependence" condition from statistical physics. Examples 
from classical spin systems at high temperature and random proper graph colorings 
are worked out in the subsequent sections. 

In section 1^31 we shall use the results of this section in another direction to obtain 
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concentration of Haar measures on compact groups based on convergence rates of 
random walks. The result will be used in section to derive a quantitative version 
of Voiculescu's connection |114j between random matrices and free probability theory. 



Our construction of F will involve the Poisson equation associated with the kernel P, 
which can be written as follows: 



Here / is a given function, and the objective is to solve for g. The Poisson equation is 
an object of deep mathematical significance. Its importance in the theory of Markov 
chains was realized after the work of Paul Meyer [7^] and the subsequent investigation 
of Neveu jTS]. The contributions by Nummelin [70] are also significant. A classical 
textbook reference is the book by Revuz Poisson's equation has been used in 
the probability literature by too many authors to mention in this short space; for 
a recent survey of the literature about Poisson's equation in the context of discrete 
Markov chains, we refer to Makowski and Shwartz 

The way we can think of using the solution to Poisson's equation in our problem 
is the following: If we let F{x,y) = g{x) — g{y), where g is a. solution to (j4.2j) . then 
F is antisymmetric and 



4.1 Explicit construction of F{X,X') 



9-P9 = f- 




E{F{X,X')\X) = g{X) - n9{X')\X) = g{X) - Pg[X) = f{X). 



Generally, the solution to Poisson's equation is given by 



oo 




fc=0 



but problems with convergence are not uncommon. The solution need not be unique, 
either. Various conditions have been proposed for the existence and uniqueness of 
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solutions of Poisson's equation in various situations. For a summary of such results, 
we refer to [64] . 

We shall not need the weakest conditions for the existence of solutions to Poisson's 
equation in any of our applications. The following lemma formalizes our construction 
of F under generous assumptions, which are satisfied for geometrically ergodic Markov 
chains, and all of our examples belong to that class. 

Lemma 4.1 Let f : X ^ M. be a measurable function such that E/(X) = 0. Suppose 
there is a finite constant L such that 

oo 

^\P''f{x)-P''f{y)\<L for every X and y . (4.3) 

fc=0 

Then the function 

oo 

F{x,y) ■.= Y,{P'f{^)-P'f{y)) 

k=0 

satisfies F{X,X') = -F{X',X) and E{F{X,X')\X) = f{X). 

Proof. The convergence of the series defining F follows from the summability as- 
sumption (j4.3|) . Also, F is clearly antisymmetric. Now, 

E(PV(^) - P''f{X')\X) = P^f{X) - P^+^f{X). 

Thus, for any A^, we have 

N 

J2E{P'f{X) - P'f{X')\X) = f{X) - p^+V(x). 

k=0 

The condition ()4.3p ensures that the above partial sums converge everywhere. Conse- 
quently, the sequence {P^+^/(X)}jv>o also converges everywhere. Again, condition 
(j4.3|) implies that the limit is a constant, since for any x and y, P^f{x) —P^f{y) —>■ 
as A; oo. Since KF{X, X') = = E/(X), this constant can only be zero. □ 
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Although Lemma l4. II gives an exphcit expression for F, it is somewhat inconve- 
nient for practical purposes. We shall now give a coupling version of Lemma [4.11 that 
will be easier to work with in practice. 

Let {Xk}k>o and {X[.}fc>o be two chains from the kernel defined by {X,X'), with 
arbitrary initial values, and coupled according to some coupling scheme which satisfies 
the following property: 

P For every initial value {x,y), and every k, the marginal distribution of X^ 
depends only on x and the marginal distribution of depends only on y. 

Under this assumption, we have the following lemma: 

Lemma 4.2 Suppose the chains {Xk} and {XI} satisfy the property P described 
above. Let / : X — M 6e a function such that E/(X) = 0. Suppose there exists a 
finite constant L such that for every {x,y) G X^, 

oo 

mfiXk) - f{X',)\Xo = x,X', = y)\ < L. (4.4) 

A:=0 

Then, the function F defined as 

oo 

F{x,y) :=J2nf{Xk) - f{X',)\Xo = x,X', = y) 

k=0 

satisfies F{X,X') = -F{X',X) and E{F{X,X')\X) = f{X). 

Remark. In practice, we will start our chains with Xq = X and Xq = X' for directly 
obtaining bounds on F{X,X') in the process of verifying (j4.4p . 

Proof. Property P implies that 

E(/(Xfc)|Xo = x,X^ = y) = E(/(Xfe)|Xo = x)= P'^fix), 



where P is the kernel defined in flI?T|) . Similarly, E(/(X{.)|Xo = x, Xq = y) = P'^fiy). 
The rest is a rewriting of the conclusions of Lemma 14.11 □ 
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Example. Let X = M", and let X = . . . be a vector with independent 

components. (The subscripts are put in brackets because we will be using {Xfc}fc>o 
to denote a Markov chain on M".) As a preliminary example, and also as a prelude to 
section 1^21 we shall now work out the concentration of f{X), where / is a function 
satisfying 

n 

\f{x)-f{y)\<J2c^^x,^yi}. 

1=1 

In other words, if x and y differ only at coordinate i, then \ f{x) — f{y)\ < Ci. Our 
technique will give an easy way to recover a version of the bounded difference inequal- 
ity f Theorem 12. 3p without using martingale or information theoretic results. We shall 
only give a brief description of the steps in our solution, because the details will be 
worked out under a more general setting in the next section. 

We produce X' by choosing a coordinate / uniformly at random, and replacing 
the coordinate of X by X^jy where the vector {X*^-^, . . . , X^^-^) has the same 
distribution as X and is independent of X. It is easy to see that {X,X') is an 
exchangeable pair. 

The coupling is done in the natural way: At every step, choose the same coordinate 
/ for both chains, and replace the P^ coordinate of both by the same realization of 
X*jy Since the number of coordinates at which Xj. and X( differ cannot increase 
with k, it is clear (by the coupon collector's problem) that the coupling time for the 
chains has expected value bounded by nlogn. This proves condition ()4.4|) . 

Now suppose we start with (Xq, Xq) = {x, y). If x and y differ only at coordinate 
i, then \ f{Xk) — /(X[.)| < Cj for every A;, and the expected coupling time is n. Thus, 
by the representation of F in Lemma 14.21 we get \F{x^y)\ < nci. Consequently, we 
have 

v{X) = ^-E{\{f{X) - f{X'))F{X,X')\\X) 

i=l i=l 

Theorem 13.31 now gives us a version of the well-known bounded difference inequality 
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f Theorem \2.'d\i . albeit with a missing 2 in the exponent. 

4.2 Concentration under weak dependence 

Let X = f2", where f2 is a Pohsh space. Consequently, our random variable X will 
now have n coordinates, which we shall assume to be weakly dependent in the sense 
of Dobrushin, a familiar notion from statistical mechanics which we shall define later 
in this section. 

We shall also assume that our function / satisfies a Lipschitz condition with 
respect to a generalized Hamming distance on X: 

n 

\f{x)-f{y)\<Y,CiI{xi^yi} (4.5) 

i=l 

for some fixed constants ci, . . . ,c„. As mentioned before, this just means that the 
value of the function does not change by more than Cj if the i^^ coordinate is altered. 

The generalization of the Hamming metric is useful because it allows us to get 
concentration for lower dimensional marginals. For instance, if we want the concen- 
tration of a function of (Xi, . . . , X^), where A; < n, we can just consider every function 
of (Xi, . . . , Xfc) as a function of (Xi, . . . , X„) with Ck+i = ■ ■ ■ = c„ = 0. 

In the following, we shall be using the notation to denote the element of 
obtained by omitting the i^^ coordinate of the vector x G . 

We shall let fi denote the law of X, and for each i and x, will stand for 

the law of the i^^ coordinate of X given that X* = x*. 

At this point, let us also mention that we shall usually denote the i^^ coordinate 
of a vector x by Xj. Unfortunately, we are also using the notation X^ for the fc*^ 
element of a Markov chain. We hope that the context will clarify any ambiguity. In 
particular, to denote the i*^ coordinate of X^ we shall use the notation X^^j. 

Finally, let us recall that for a square matrix A, the operator norm of A is 
defined as: 

\\Ah:=max\\Ay\\. 
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While we are at it, let us also recall that the and operator norms can be 
expressed as 

n n 

||y4||oo := max > |aj,-|, and \\A\\i := max \ \aij\. 

l<i<n ' ^ l<j<n ' ^ 



It is a frequently useful fact that ||v4||2 < y/||v4||i| 

The following theorem gives a simple way of obtaining concentration bounds for 
/(X) under a contractivity condition on the conditional laws: 

Theorem 4.3 Suppose A = {aij) is an n ^ n matrix with nonnegative entries and 
zeros on the diagonal such that for any i, and any x,y & Q^, 

n 

dTv{lJ"ii-\x'),iJ,i{-\y')) < ^aijl{xj ^ 

where d^v is the total variation distance on the space of probability measures on Q. 
Suppose f satisfies the generalized Lipschitz condition If \\A\\2 < 1, we have 



P{|/(X)-E/(X)|>t}<2exp 



for each t > 0. 



Remarks. The matrix A is called "Dobrushin's interdependence matrix" and the 
condition ||A||oo < 1 is called Dobrushin's condition (as introduced by Dobrushin 
P3] and extended by Dobrushin & Shlosman EZ])- Dobrushin's condition en- 
sures, among other things, the uniqueness of Gibbs states at high temperature. In 
a series of important papers, Stroock and Zegarlinski |103l I1U41 1105] showed that 
for spin systems on a lattice, the Dobrushin-Shlosman conditions are equivalent to 
the validity of logarithmic Sobolev inequalities for the associated Glauber dynamics. 
Though concentration inequalities follow from log-Sobolev inequalities, explicit con- 
stants are not available from this body of work. Our approach gives a direct way of 
getting explicit concentration bounds from Dobrushin type conditions. For more on 
Dobrushin's condition and its consequences, see Georgii's book Chapter 8. For 
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a recent treatise on logarithmic Soboloev inequalities in the context of spin systems 
and Dobrushin-Shlosman type conditions, see Ledoux 

As mentioned in section Y2A\ Dobrushin type conditions were recently used by 
Marton [HZl IHH] to obtain transportation cost inequalities (and hence concentration) 
under weak dependence. However, Marton's results do not seem to be suited for the 
setting that we are now working under. Besides, they require us to know some bounds 
on certain log-Sobolev constants, which are hard to obtain. 

Proof of Theorem 14. 3L To prove this theorem, we shall construct a reversible 
Markov kernel and a suitable coupling, and then directly apply Lemma 14.21 in con- 
junction with the tools from Chapter |21 

First, define a reversible Markov kernel as follows: At each step, choose a coor- 
dinate / uniformly from {1, . . . ,n}. Then replace the I^^ coordinate of the current 
state of the chain by an element of Q chosen according to the conditional distribution 
of the coordinate given the values of the other coordinates. This is the usual 
Gibbs sampling technique, and it is well-known and easy to prove that the chain is 
reversible. This is also known as the "Glauber dynamics" in the case of spin systems. 

Now, we describe the coupling. Suppose at any stage, the X-chain is at x, and 
the X'-chain is at y. Choose a coordinate I uniformly at random. By the well-known 
property of the total variation distance, we can have two X-valued random variables 

and W2 define on some probability space such that W( ~ yU/(-|x^), W2 ~ /i/(-||/^), 

and 



Having obtained W( and W2, update the X-chain by putting W( as the P^ coordinate 
(keeping all other coordinates the same) and update the X'-chain the same way 
using This is perhaps the most naive way to couple dependent variables, and is 
commonly known as "the greedy coupling" . 



F{Wl ^ Wi} = dTvM-\x'),fXji-\y')) 



n 




(by assumption). 
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Of course, people know how to make the above construction completely math- 
ematically precise, and so we prefer to gloss over that aspect. But if someone is 
interested, he can look up the recent paper of Marton [HHI; where a more complex 
coupling has been handled in a very mathematically precise manner. 

It is clear from the definition that this coupling indeed gives the Gibbs sampler 
chains as the marginals. To see that property P holds, note that for any i, the 
distribution of as defined above depends only on the current state of the X- 
chain, irrespective of the status of the X'-chain. Similarly, the distribution of W2 
depends only on the current state of the X'-chain. Thus, the distribution of Xk+i 
given (Xfc, X^) depends only on X^,. Since the coupling is Markovian, we can proceed 
by induction to get a proof of property P for this coupling. 

Now assume without loss of generality that E/(X) = 0. The summability condi- 
tion ()4.4|) will be verified later in the proof. 

Fix k > 0, and suppose we used the above scheme to move to the {k + stage. 
Then for any i, 

nXk+i, ^ and I ^ 2|X,,Xa = (^1 - ^ ^I.}, 

and 

F{Xk+l,^ ^ X^+,_, and / = i\Xk,X',} 

ft ft 

Adding up, and taking expectation given (Xo,Xq) on both sides, we get 

^{Xk+i,i 7^ X^_^^^.|Xo,Xo} 

< 1 - - P{X,, ^ X^^jXo, X',} + -J2 <^^jnXkj ^ X'.jXo, X',}. 

For each k, let ik = ^^ki^o, ^q) be the random vector whose i^^ component is P{Xfc_j 7^ 
X^ jXo, Xg}. Let B = {1 — j^)I + j^A, where / denotes the identity matrix. The above 
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inequality shows that i^+i < Bij. for every k, where 'x < ?/' means that Xi < i/i for 
each coordinate i. Since everything is nonnegative, we can continue by induction to 
see that 

4 < B'^io. 

Now let c = (ci, . . . , Cn) be the vector of constants from ()4.5|) . Note that 

||B||,<l-i+Mk^l_ ,4,8) 

n n 

and hence, by the generalized Lipschitz property of /, 

oo oo oo 

- /(X^)|Xo,X^)| < 5^E(c -^fclXo.X^) < ||c||||£o|| < ^- 

k=Q k=0 k=0 

This establishes condition ()4.4|) . and so we can now invoke Lemma Produce X' by 
starting from X and taking one step according to the Gibbs sampler kernel. Putting 
Xq = X and Xq = X', we get 



F{X,X') = J2^{f{X,) - f{Xi)\Xo,X',) 
by Lemma f4. 21 Thus, if v is defined as in ()3.ip . then 

^ oo 

v{X) < -$^E(|(/(Xo) - /(X^))(/(X,) - /(X^))||Xo). (4.7) 

fc=0 

Now, by our definition of ik and the property ()4.5|) of /, we have 

E(|(/(Xo) - /(X^))(/(X,) - /(X^))| |Xo, X^) < (c ■ io){c ■ 4). 

Note that with (Xo,Xq) = (X, X'), Xq and Xq can differ at most at one coordinate. 
Suppose they differ at coordinate i. Then 

(c ■ 4)(c ■ 4) < (c ■ 4)(c • BHo) = c,{c ■ B^i), 
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where Cj denotes the vector whose i^^ coordinate is 1 and all other coordinates are 0. 
Now, the probability that the Xq and Xq differ at the i^^ coordinate is < 1/n. Hence, 

E(|(/(Xo) - /(X^))(/(X,) - /(XD)||Xo) 



1 " 1 
<-y"ci(c-B''ei) = -c-B^c< 

i=l 

Thus, by (gH) and (gSl), we get 



\B\mc 



n — ' n n 

1=1 



v{X) < " ' < 



2n{l - \\B\\2) - 2(1- IIAII2) 
The proof is now completed by using Theorem 13.31 □ 

4.3 Example: Spin systems 

We now specialize to the case f2 = [— 1, 1]. Let z/ be a probability measure on [—1,1]. 
Suppose our original measure /i on [—1, 1]" has a density with respect to u"-, repre- 
sented in the Boltzmann form: Z~^e~^^^\ where Z is the normalizing constant. 

The measure u will usually be either the normalized Lebesgue measure, or the 
symmetric distribution on {—1,1}. The generalized framework will allow us to deal 
with both discrete and continuous problems. 

The following lemma gives us a simple way of applying Theorem 14. 31 in this setting 
under the assumption that H can be extended to a twice continuously differentiable 
function on [—1,1]"'. 

Lemma 4.4 For each pair with i 7^ j , define 



lij := 4 sup 

xG[-1,1]" 



[X. 



dxidxj 

and let an = for each i. Then for each i and x,y E [—1, 1]", we have 

n 

dTvilJ'i{-\x'),iJ,i{-\f)) < ^a^jl{xj 7^ yj}. 
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Remark. It is obvious that this lemma, combined with Theorem 14 .31 gives concentra- 
tion inequahties for most of the famihar spin models at sufficiently high temperature. 
It covers, for instance, the Ising ferromagnets (in any dimension), the xy-model, and 
even the Ising spin glasses. This lemma is in fact, just a more easily recognizable ver- 
sion (for probabilists) of Proposition 8.8 in Georgii's book ^21, which gives a similar 
condition for Gibbs potentials to satisfy Dobrushin's condition, and gives some more 
examples. 

Example. Consider the Ising model on a graph G = {V, E), where V = {1, . . . ,n} 
and the maximum degree is r. Here Q = { — 1, 1} and u is the symmetric probability 
distribution on Q. The Hamiltonian at inverse temperature f3 and external field h is 
given by 

Hp{x) := —(3 XjXj — (3h Xj. 
We shall consider the concentration of the magnetization^ defined as 



1 " 

mix) := — > Xi. 



n 

i=l 



The Hamiltionian has a natural extension as a function on M", and a simple 
computation gives 

Ul3 if(^,j)ei?, 
,0 ii{i,3)^E. 

Thus, the L°° and L norms of the matrix A = [ both bounded by 4/3r, where 

recall that r is the maximum degree of G. Thus, \\A\\2 < a/|| A||i||v4||oo < 4/3r. 
Again, it is clear that 

2 " 

\m{x) - m{y)\ < - S^Mxi ^ yi} 
n ^-^ 

i=l 

Thus, if X = (Xi, . . . , Xn) is a spin configuration drawn from the Gibbs measure at 
inverse temperature f3 < l/4r and external field h, and m(X) is its magnetization. 



CHAPTER 4. THEORY AND EXAMPLES: PART H 



81 



then by Lemma (4.41 and Theorem 14. 3[ we have 



in(l-4/3r)t2 



m{X) - Em(X)| >t\< e"^ 



This gives an exphcit concentration bound on the magnetization at sufficiently high 
temperature, though it need not necessarily cover the entire high temperature domain. 



Proof of Lemma 14. 4L The density with respect to v of the conditional distribution 
is given by 

Pi{u\x' 



where {u, x*) denotes the vector obtained by substituting the number u as the i^^ 
coordinate of the vector x. 

Now fix i and j, where i ^ j. Direct computation shows that 



d 

dxj 



Pi{u\x') 



dH r OH 

'u, X*) — / — — (v, x'^)pi{v\x'^)iy{dv) ] pi{u\x'' 



dxj 



and hence 



d 



dx. 



■Pi{u\x') 



< 



sup 



dxi 



dH, dH 



Pi{u\x') 



< —pi{u\x'^ 



Thus, for any A C [—1, 1], we have 

d 



dx 



-Pi{A\x' 



_d_ 

A dxj 



Pi{u\x'^)h'{du] 



Thus, we can conclude that for any x,y E [—1, 1]' 



dTviPii-\x'),pi{-\y')) 

n 

= sup \pi{A\x') - pi{A\f)\ < ^aijl{xj ^ Vj}. 
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This completes the argument. □ 

4.4 Example: Graph colorings 

Suppose G = (V, E) is a graph with n vertices and maximum degree r. A fc-coloring 
of G is an assignment of k colors to the vertices of G. In other words, it is an element 
of {1, . . . , k}'^. A proper fc-coloring is a fc-coloring of G in which no two adjacent 
vertices receive the same color. 

Let X = [Xi, i & V) he a coloring of G chosen uniformly from the set of all proper 
A;-colorings. 

A substantial amount of recent effort in theoretical computer science and associ- 
ated probability theory has been devoted to the study of this random variable X and 
Markov chains converging in distribution to X. Most of it centers around temporal 
and spatial mixing and decay of correlations. An early observation of Jerrum j5& and 
Salas & Sokal [Ml, followed by vigorous activity in the last few years, have resulted 
in a spate of sophisticated coupling techniques and improved results. For up-to-date 
references, one can look at the recent articles [T] I39j. 

However, although coupling techniques have revolutionized the analysis of spatial 
and temporal decay of correlations in graph colorings, it is still extremely difficult to 
get concentration inequalities for general functions in these settings. In fact, we can 
see no direct way of getting concentration bounds for as simple a functional as the 
proportion of vertices having a particular color using currently available techniques. 

In this section, we shall use our methods to get concentration inequalities for 
arbitrary (generalized) Lipschitz functions of randomly chosen proper graph colorings. 
That is, we shall consider /:{!,..., fc}'^ M satisfying \ f{x) — f{y)\ < J2^=i Cilja^i 7^ 
Hi}, and get tail bounds on f{X), where X is our random proper coloring of G. 

Both Jerrum [Sn] and Salas & Sokal [^^1 use the greedy coupling to get decay of 
correlations under the condition k > 2r. This is sort of a "high temperature phase" 
for proper graph colorings. Since our purpose is only to demonstrate the manner of 
application of our technique, we shall stick to this primitive approach. 
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Proposition 4.5 Suppose G is a finite graph with maximum degree r, X is a uni- 
formly chosen proper k-coloring of G, and f : {1, . . . , k}'^ M satisfies \f{x) — 
f{y)\ < J2^=i Cil{xi Vi}- If k > 2r, then for any t >0 we have 



P{|/(X)-E/(X)|>t}<2exp 



It' 



where 7 = (A; — 2r) /{k — r). 

Example 1. Let f{x) be the proportion of vertices receiving color 1. It's clear that 



1 " 

|/(x)-/(y)|<-5^I{x.^y.} 



n 

i=l 



Also, by symmetry, E/(X) = 1/k, where X is drawn uniformly from the set of all 
proper fc-colorings. Using the above result, we get 



P{|/(X) - fc-^l >t} < 2e 
where 7 = (fc — 2r) /{k — r) as in the lemma. 

Example 2. Let f{x) be the number of neighbors of vertex 1 which receive color 1. 
Then we can take Cj = 1 if i is a neighbor of 1 and Cj = otherwise. Consequently, 
J2i Ci = Ti, where ri is the degree of vertex 1. Thus, 

P{|/(X) - E/(X)| >t}< 2e-^*'/"\ 



Proof of Proposition 14.51 As usual, denote the law of given X* by It 
is easy to see from definition that is the uniform distribution over the set 

{l,...,k}\{x,:{t,j)eE}. 

Thus, given two vectors x and y which differ only at a coordinate j, where j is adjacent 
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to i, it is clear that 



k 

2 

u=l 

1 



k 



1 

< 



k — r 



Thus, we can take Qij = l/{k — r) if G E and ajj = otherwise, and apply 
Theorem 14.31 to complete the proof. □ 



4.5 Concentration of Haar measures 

Let G be a compact topological group. Then there exists a G-valued random variable 
X with the properties that for any x & G, the random variables xX, Xx and X~^ all 
have the same distribution as X. As mentioned in section where we review the 
literature about concentration of measure on groups, the existence and uniqueness 
of the distribution of X (which is called the "Haar measure on G"), is a classical 
result (see e.g. Rudin (HH], Theorem 5.14). Our goal in this section is to study the 
concentration of the Haar measure using the convergence properties of certain kinds 
of Markov chains which have the Haar measure as their stationary distribution. An 
original application to the concentration of the spectra of sums of random matrices 
(related to free probability) will be given in the next section. 

We shall not repeat the discussion of the existing literature on concentration of 
Haar measures, which is done in section 12.61 of the review chapter. It suffices to say 
that there is not much work. The only general theorem we could detect is a result 
of Milman & Schechtman [77j which is stated in section 12.61 where we also discuss 
available results for special groups like SOn (the special orthogonal group) and Sn 
(the symmetric group). 

We should, however, mention that we are not the first to analyze measures on 
groups using Stein's method. Diaconis j2ZI has an application of Stein's method to 
the analysis of random walks on groups. More recently, Jason Fulman has applied 



CHAPTER 4. THEORY AND EXAMPLES: PART H 



85 



Stein's method to analyze the Haar measure and the Plancherel measure 01] 
on Sn- 

Let us now introduce our setting. Let y be a G-valued random variable having the 
following properties: 

1. The random variable has the same distribution as Y; that is, the law of Y 
is symmetric. 

2. For any x E G, xYx~^ has the same distribution as Y. In other words, the 
distribution of Y is uniform on each conjugacy class of G. 

Any such Y defines a reversible Markov kernel P is a natural way: For any / : G ^ M 
such that E|/(X)| < oo, let 

Pf{x) := Ef{Yx) = Ef{xx~^Yx) = E/(xF). (4.8) 

The reversibility of this kernel can be proved as follows: Since yX has the same 
distribution as X for any y E G, therefore Y and YX are independent. Also, Y~^ 
has the same distribution as Y. Hence, the pair {X, Y) has the same distribution 
as {YX,Y~^). Consequently, the pairs (X, FX) and {YX,Y-^YX) = iYX,X) also 
have the same distribution. In other words, if we let X' = YX, the (X, X') is an 
exchangeable pair. Finally, note that Pf{x) = E(/(X')|X = x). 

We shall attempt to study the concentration of X using the properties of this 
kernel P. A version of this problem was considered by Schmuckenschlager |HH|; but 
no practical solution was given. We shall now present a theorem that should suffice 
in many problems, but with "an extra logn factor". As mentioned before, concrete 
examples will be provided later on. 

Theorem 4.6 Let G, X, Y be as above. Let f : G —>■ W be a measurable function 
such that E/(X) = 0. Let ||/||oo = sup^..^^; |/(x)| and 

\\f\\y:=snp[Eifix)-fiYx)rY^\ 
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Let Yi, Y2, . . . , be i.i.d. copies of Y . Suppose a and b are two positive constants such 
that dTviXi^2 ■ ■ ■ ^ki X) < ae~^^ for every k, where dxv is the total variation metric. 
Let A and B be two numbers such that ||/||oo < A and ||/||y < B. Let 



Then Var(/(X)) < C, and for any t > 0, >t}< 26"*' /^c^ 

Remarks. The rate of decay of the total variation distance for random walks on 
groups has been a topic of interest ever since the work of Diaconis & Shahshahani 
who introduced group representation tools to overcome the deficiencies of the usual 
spectral gap approach (which gives a wrong answer for the random walk with trans- 
positions on Sn)- The property that Y is uniformly distributed on conjugacy classes 
is particularly suited to this line of analysis. 

Since the seminal paper much work has been done; several other methods 
have been developed by various authors, and the rates have been evaluated in many 
interesting problems. For a recent survey with an extensive bibliography, we refer to 
Diaconis [2H]- 

To the best of our knowledge. Theorem 14.61 is the only result as of now, which 
connects the rates of convergence of random walks on groups — which is a widely 
explored area — with the concentration of Haar measures, which is not so widely 
explored. 

Before we move on to complete the proof of Theorem 14.61 it is time for an easy 
application, which will not give a better-than-existing result, but merely demonstrate 
how Theorem 14.61 can be applied. An original application will be given in the next 
section. 

Random permutations. Let G = Sn, the group of all permutations of n elements. 
Then the Haar measure is just the uniform distribution on G. Let X be a uniformly 
distributed random variables on G. We define the distribution of our Y by putting 
mass 1/ri on the identity permutation and on each transposition of two elements. 
Since the set of transpositions is closed under conjugation and inversion, Y satisfies 
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the required properties 1 and 2 from the previous section. The kernel P defined by Y 
as in (j4.8|) is a famihar object which has been studied by several authors. The study 
was initiated by Diaconis & Shahshahani who proved the following result: 

Theorem 4.7 [Diaconis & Shahshahani [33]] Let Yi,Y2, . . . , be i.i.d. copies of the 
random variable Y defined above. Then dTviXi ■ ■ ■ Yk.,X) < 6?7,e~^^/" for every k > 
logn. 

The above version of the result has been quoted from [2H1, where it is written in a 
slightly different manner. Note that if we substitute k = ^nlogn, the right hand side 
becomes 6, which is > 1, and hence the condition k > in logn is redundant. 

Using the Diaconis- Shahshahani result, we get a = Qn and 6 = 2/n in Theorem 
14. 6[ and hence the following result: 

Proposition 4.8 Let G = Sn, where n> 2, and let X, Y be as above. Let / : G M 
be a function such that E/(X) = 0. As in Theorem \4.6\ let ||/||y = maXo-eG[E(/(a') — 
f{Ya)yY^'^. Let A and B be two numbers such that ||/||oo < A and ||/||y < B. Let 

^_nB 

Then Var(/(X)) < C, and for any t>Q,we have > t} < 2e-*'/2C_ 

For a simple application of this proposition, consider the descent function on 5'„, 
defined as 

n-l 

i=l 

The number of descents of a permutation is an interesting quantity from statistical 
and combinatorial points of view. They have been studied extensively by Foata 
and Schutzenberger jl21, Knuth jSlI and in the unpublished notes of Diaconis & 
Pitman [SI]. Fulman has applied Stein's method to the study of descents. 

Let X be uniformly distributed on Sn- Clearly, ||D(X) — EZ}(X)||oo < n. Now, 
for any x E G and any transposition \D{x) — D{yx)\ < 4. Putting A = n and 
i? = 4 in Theorem 14. 8t and assuming n > 10, we get C < 4n(21ogn + 3.1). Thus, for 



/ 24:nAY (2/n)(2-e 
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n > 10, we get Var(D(X)) < 4n(2 logn + 3.1) and for any t > 0, 
¥{\D{X) -ED{X)\ > t} < 2expj 



8n(2 logn + 3.1) 



The bounds are clearly off by a factor of log n, but that will be a perpetual inconve- 
nience of using Theorem 14. fil However, it is interesting to note that Maurey's theorem 
(HOI, stated as Theorem EH in section ES} gives the bound ¥{\D{X) - ED{X)\ > 
t} < 2e~*^/^^^" in this problem, which, though technically "better" than our bound, 
is always going to be worse in all practical situations. 

That apart, there is a more serious reason why Proposition 14.81 mav give better 
results in some problems. Maurey's result essentially uses ||/||Lip instead of ||/||y, 
where ||/||Lip = niax{|/(cr) — /(rcr)! : a E Sn, r is a transposition}. Clearly, ||/||y < 
||/||Lip, and the difference may be significant in some situations. 

Talagrand's result ( |l()7j . Theorem 5.1), on the other hand, will probably dominate 
Proposition 14.81 most of the time, though it is not clear whether Talagrand's method 
can be used to derive a result based on something like ||/||y. 

We now move on to prove the central result of this section. 

Proof of Theorem 14.61 Let X' = YX. As observed before, (X, X') is an exchange- 
able pair. Recall that we defined Pf{x) = Kf{Yx). By Lemma f4.1|, 

v{x) < -J2nifi^)~fiYx){P'f{x)~P'f{Yx))\, (4.9) 

A:=0 

where v{x) is the usual quantity in our theory, as defined in ()3.1|) . The criterion ()4.3p 
required for Lemma f4.1l holds, since for any z & G, 

\P'f{z)\ = \P^f{z) - E/(X)| = |E/(ri ■ ■ ■ Y,z) - Ef{Xz)\ 
<2\\fUdTv{Yi---Yi,,X)<2\\f\\^ae-''K 
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This observation also gives 

E|(/(x) - fiYx)iP'fix) - P^f{Yx))\ 



< 4||/|Uae-^'=E|/(x) - f{Yx)\ < 4||/|Uae-^'= 



(4.10) 



Y- 



Now recall the assumption 2 that for any y E y ^Yy has the same distribution as 
Y . Thus, for any x,y E G, 

Pfiyx) = E{Yyx) = E{yy'^Yyx) = E{yYx). 

So, if we let Y' be an independent copy of Y, then 

E(P/(x) - Pf{Yx)y = E(E(/(rx) - fiYY'x)\Yf) 

< E(/(r'x) - fiYY'x)f 
<snpE{fiy'x)-f{Yy'x)r = \\fry. 

y'GG 

Thus, ||P/||y < ||/||y. Continuing by induction, we get ||PVl|y < ll/l|y- Thus, 

E|(/(x) - fiYx)iP'fix) - P'fiYx))\ 
< (Eifix) - f{Yx)fy'\E{P^f{x) - P^f{Yx)fy" 
<ll/lkrVl|y<||/||^ (4.11) 

Using and (HTT|l in (jOI), we get 

^ oo 

v{x) < -Y,^m{\\f\\lAa\\f\U\nye-''} 

k=0 



^ oo 

< - ^ min{5^ AaABe-''''} 



2 

k=0 

— ^min{l,4aA5-ie-^'=}. (4.12) 

fc=0 



We shall now compute a bound on the above sum. For ease of notation let j3 
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AaAB^^, and let 7 = b~^logf3. If /3 < 1, the sum is just a geometric series which 
is easy to evaluate. Now assume (3 > 1. Then 7 is nonnegative. Now, an easy 
verification shows that I3e~^^ = 1, and 1 > (3e~^^ if and only if /c > 7. Hence, 

00 

min{l, (3e-^^] < 7 + 1 + J]] /^e"^'' 

fc=0 A:>7 

00 ^ 

< 7 + 1 + Pe-'^ Yl = 7 + 1 + 

r=0 

To finish, we substitute this bound in (j4.12j) . and use Theorems 13. 21 and l3.3l from Chap- 
ter El □ 



4.6 Application to random matrices and free 
probability 

Let M be an n X n complex hermitian (i.e. self-adjoint) matrix. We shall fix the 
following notation for the rest of this section. 

• The empirical spectral measure M is the probability measure on M, denoted by 
fiM, which puts 1/n on each eigenvalue of M, repeated by multiplicities. 

• The empirical distribution function of M, denoted by Fm, is the distribution 
function corresponding to the empirical spectral measure. 

• Let A be a diagonal matrix whose diagonal elements are the eigenvalues of M. 
Then any hermitian matrix which has the same spectrum as M can be written 
as UAU*, where [/ is a unitary matrix. Thus, a uniform measure on the set 
of all such matrices is naturally induced by the Haar measure on the group of 
all unitary matrices of order n, which we shall denote by lin- This probability 
measure will be denoted by pm- 

A fundamental observation of Voiculescu |114j is the following: If M and are two 
hermitian matrices of order n with empirical measures hm and /iat, and n is large. 
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then the empirical measure Hm+n of M + is "approximately determined" by hm 
and /i TV (irrespective of M and A^) for "most choices of M and . The meaning of 
the phrases in quotes is made precise by the following result: 

Theorem 4.9 [Voiculescu's result, as stated in Biane ^3]] For each positive integer 
n, let An and Bn he two hermitian matrices, whose eigenvalues are bounded uniformly 
in n. Let fii and fi2 be two probability measures with compact supports on M., such that 
fiA„ —>■ f^i o,nd fiBn ~^ fJ'2 weakly as n oo. Then there exists a probability measure, 
depending only on fii and jj,2, denoted by /ii ffl ^2, such that fiA'„+B'„ — > yUi ffl yU2; 
whenever A'^ and B'^ are random matrices chosen independently with distributions 
PA^ and pb„. 

Voiculescu |114j proved this striking result in a limiting form, using the method of 
moments and some concentration results of Szarek |lU6j and Gromov & Milman |47i| . 
The theorem was used by Voiculescu to establish a connection between free proba- 
bility theory and random matrices, which resulted in an explosion of activity in the 
free probability literature. Another proof of Voiculescu's observation, using Stieltjes 
transforms, was given by Pastur and Vasilchuk f83j. A good review of the literature, 
as well as a good exposition, is given in Biane J3]- Another useful reference is the 
lecture notes by Voiculescu |115j . 

To the best of our knowledge, all the existing results are limiting statements; 
quantitative bounds on the concentration of Fm+n given Fm and F/v for finite n are 
not available in the literature. We shall now state and prove one such result, as an 
application of the machinery developed in the previous section. 

Theorem 4.10 Let Ai and A2 be two n x n real diagonal matrices. Let U and 
V be independent Haar distributed random elements ofUn, the group of all unitary 
matrices of order n. Let 

H = UAiU* + VA2V*, 

and let Fh be the empirical distribution function of H . Then, for every x G M, 
Var(F//(x)) < K,n~^ logn. where n is a universal constant not depending on n, Ai, 
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A2 or X. Moreover, we also have the concentration inequality 

¥{\Fh{x) - ^{Fh{x))\ > t} < 2exp 
for every t > 0, where k is the same as in the variance bound. 

To prove Theorem 14. 10^ we first need to establish a theorem about the concentration 
of the Haar measure on Un- Existing results of the type discussed in section I2.(jl 
cannot give concentration bounds for Fh, since they are based on the Hilber-Schmidt 
distance which is too crude for such a delicate problem. Instead, we shall try to find 
the concentration of the Haar measure with respect to the rank distance, defined as 
d{M, N) := rank(M — A^). That this is indeed a metric, follows from the subadditivity 
of rank. The empirical distribution function is well-behaved with respect to this 
metric, as shown by the following lemma of Bai [S]: 

Lemma 4.11 [Bai |H], Lemma 2.2] Let M and N be two n x n hermitian matrices, 
with empirical distribution functions Fm and F^- Then 

\\Fm -Fn\\oo< -rank(M - N). 
n 

This lemma is an easy consequence of the interlacing inequalities for eigenvalues of 
hermitian matrices. It seems possible that this already existed in the literature before 
Bai jH], but we could not find any reference. 

We shall use Theorem 14.61 to find the concentration of the Haar measure on Un 
with respect to the rank metric. For that purpose, we need a random walk which 
takes "small steps" with respect to that metric. 

Let G = Un and X be a Haar-distributed random variable on Un- We define 
the Y required for generating the random walk for Theorem 14.61 as follows: Let 
Y = I — {1 — e*''')'U'U*, where u is drawn uniformly from the unit sphere in C", and 
(fi is drawn independently from the distribution on [0, 2tt) with density proportional 
to (sin((/)/2))"~^. Multiplication by Y represents reflection across a randomly chosen 



nt^ 
2k, log n 
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subspace. It is easy to verify that Y G Un- Now, for any U G IX^, 

UYU* =/-(!- e''P){Uu){Uu)\ 

and Uu is again uniformly distributed over the unit sphere in C". Also, Y^^ = Y* = 
/ — (1 — e'^'^)uu* = / — (! — e**^^'^~'^^)MM* has the same distribution as Y, since 2tt — (p 
has the same distribution as ip. Hence Y satisfies the properties 1 and 2 from section 
14 .51 Following a sketch of Diaconis & Shahshahani [53J, Ursula Porod 'HH' proved the 
following result: 

Theorem 4.12 [Porod [88 ] Let X,Y be as above. Let Yi,Y2, . . . , be i.i.d. copies of 
Y. There exists universal constants a,/3,Co, such that whenever n > 16 and k > 
logn + Con, we have 

drviYi ■■■Yk,X)< an^/2g-/3fc/n_ ^4_^3) 

Substituting k = logn+con, we get ae~^'^" on the right hand side. Thus by suitably 
increasing a such that ae~^'^° > 1, we can drop the condition that k > ^n\ogn + cqu. 
Combining Porod's theorem with Theorem 14 .61 we get the following result about 
concentration of the Haar measure on 

Proposition 4.13 Let G = 1X„ and X,Y be as above, with n > 16. Let / : 1X„ — M 

be a function such that E/(X) = 0. Let ||/||y = snpueu,Mf{U) - f{YU)fY/\ Let 
A and B be constants such that ||/||oo ^ A and ||/||y < B. Let 

C 



2(3 



B 1 - e-/5/" 



where a and [3 are as in \4-^3[ ). Then Var(/(X)) < C , and for any t > 0, we have 
P{|/WI >t}< 2e-*'/2C^ 

We are now ready to finish the proof of Theorem I4.1UI 

Proof of Theorem I4.10L We shall carry on with all the notation we have already 
defined in this section. The matrix V*HV = V*UAiU*V + A2 has the same spectrum 
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as H. Also, V*U is again Haar distributed. Hence, we can write, without loss of 
generality, 

H = XA,X* + A2, 
where X follows the Haar distribution on U^. Now let 

H' = (rx)Ai(rx)* + A2. 

Recall that y = / — (1 — e"^)-u-u*, where u is drawn from the uniform distribution on 
the unit sphere in C", and (f is drawn independently from the distribution on [0, 2it) 
with density proportional to (sin((/9/2))"^^. Let 6 = 1 — e**^. Then 

H -H' = XAiX* - (/ - 5uu*)XAiX*{I - 5uu*) 
= 5Huu* + 6uu*H - \6\'^uu*Huu*. 

The three summands are all of rank 1. It follows by the subadditivity of rank that 
rank(if — H') < 3. Thus by Lemma f4.1H we get 

\\Fh -Fh' Woo <-. (4.14) 

n 

Now fix a point x G M, and let / : Un — M be the map which takes X to Fh{x). 
Then by fl4.14|) . we have 

- f{YX)\ < - for all possible values of X and Y. 

Thus, ||/||y < 3/n. Also, ||/||oo < 1- Thus, in Proposition 14. 131 we get C < ^logn + c 
for some universal constants k, and c. By choosing k large enough, we can drop the 
assumption that n > 16 and also put c = 0. This completes the proof. □ 
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